As a less time-consuming procedure, subsampling technology has been widely used in biological monitoring and assessment programs. It is clear that subsampling counts af fect the value of traditional biodiversity indic...As a less time-consuming procedure, subsampling technology has been widely used in biological monitoring and assessment programs. It is clear that subsampling counts af fect the value of traditional biodiversity indices, but its ef fect on taxonomic distinctness(TD) indices is less well studied. Here, we examined the responses of traditional(species richness, Shannon-Wiener diversity) and TD(average taxonomic distinctness: Δ +, and variation in taxonomic distinctness: Λ +) indices to subsample counts using a random subsampling procedure from 50 to 400 individuals, based on macroinvertebrate datasets from three dif ferent river systems in China. At regional scale, taxa richness asymptotically increased with ?xed-count size; ≥250–300 individuals to express 95% information of the raw data. In contrast, TD indices were less sensitive to the subsampling procedure. At local scale, TD indices were more stable and had less deviation than species richness and Shannon-Wiener index, even at low subsample counts, with ≥100 individuals needed to estimate 95% of the information of the actual Δ + and Λ + in the three river basins. We also found that abundance had a certain ef fect on diversity indices during the subsampling procedure, with dif ferent subsampling counts for species richness and TD indices varying by regions. Therefore, we suggest that TD indices are suitable for biodiversity assessment and environment monitoring. Meanwhile, pilot analyses are necessary when to determine the appropriate subsample counts for bioassessment in a new region or habitat type.展开更多
Conventional full-waveform inversion is computationally intensive because it considers all shots in each iteration. To tackle this, we establish the number of shots needed and propose multiscale inversion in the frequ...Conventional full-waveform inversion is computationally intensive because it considers all shots in each iteration. To tackle this, we establish the number of shots needed and propose multiscale inversion in the frequency domain while using only the shots that are positively correlated with frequency. When using low-frequency data, the method considers only a small number of shots and raw data. More shots are used with increasing frequency. The random-in-group subsampling method is used to rotate the shots between iterations and avoid the loss of shot information. By reducing the number of shots in the inversion, we decrease the computational cost. There is no crosstalk between shots, no noise addition, and no observational limits. Numerical modeling suggests that the proposed method reduces the computing time, is more robust to noise, and produces better velocity models when using data with noise.展开更多
We propose a subsampling method for robust estimation of regression models which is built on classical methods such as the least squares method. It makes use of the non-robust nature of the underlying classical method...We propose a subsampling method for robust estimation of regression models which is built on classical methods such as the least squares method. It makes use of the non-robust nature of the underlying classical method to find a good sample from regression data contaminated with outliers, and then applies the classical method to the good sample to produce robust estimates of the regression model parameters. The subsampling method is a computational method rooted in the bootstrap methodology which trades analytical treatment for intensive computation;it finds the good sample through repeated fitting of the regression model to many random subsamples of the contaminated data instead of through an analytical treatment of the outliers. The subsampling method can be applied to all regression models for which non-robust classical methods are available. In the present paper, we focus on the basic formulation and robustness property of the subsampling method that are valid for all regression models. We also discuss variations of the method and apply it to three examples involving three different regression models.展开更多
A new faster block-matching algorithm (BMA) by using both search candidate and pixd sulzsamplings is proposed. Firstly a pixd-subsampling approach used in adjustable partial distortion search (APDS) is adjusted to...A new faster block-matching algorithm (BMA) by using both search candidate and pixd sulzsamplings is proposed. Firstly a pixd-subsampling approach used in adjustable partial distortion search (APDS) is adjusted to visit about half points of all search candidates by subsampling them, using a spiral-scanning path with one skip. Two sdected candidates that have minimal and second minimal block distortion measures are obtained. Then a fine-tune step is taken around them to find the best one. Some analyses are given to approve the rationality of the approach of this paper. Experimental results show that, as compared to APDS, the proposed algorithm can enhance the block-matching speed by about 30% while maintaining its MSE performance very close to that of it. And it performs much better than many other BMAs such as TSS, NTSS, UCDBS and NPDS.展开更多
Subsampling plays a crucial role in enhancing the efficiency of Markov chain Monte Carlo(MCMC)algorithms.This paper presents a subsampling-based MCMC algorithm aimed at addressing the computational complexity challeng...Subsampling plays a crucial role in enhancing the efficiency of Markov chain Monte Carlo(MCMC)algorithms.This paper presents a subsampling-based MCMC algorithm aimed at addressing the computational complexity challenges of traditional MCMC methods on large-scale datasets.The proposed approach significantly reduces computational costs by approximating the full data likelihood function using only a subset of the full data in each iteration.The subsampling process is guided by the fidelity to the full data,which is measured by the energy distance.The resulting algorithm,termed the energy distancebased subsampling MCMC(EDSS-MCMC),offers a flexible approach while maintaining the simplicity of the standard MCMC algorithm.Additionally,we provide an analysis of the invariant distribution generated by the EDSS-MCMC algorithm and quantify the total variation norm between this distribution and the target distribution.Numerical experiments demonstrate the outstanding performance of the proposed algorithm on large-scale datasets.Compared with the standard MCMC algorithm and other subsampling MCMC algorithms,the EDSS-MCMC algorithm exhibits advantages in terms of accuracy and computational speed.Therefore,the proposed algorithm holds practical significance in tasks involving large-scale dataset analysis and machine learning.展开更多
A microwave photonic subsampling digital receiver(MPSDR)is proposed and experimentally demonstrated for target detection with a sampling rate of 10 MSa/s.Stepped and pseudo-random frequency-hopping signals with freque...A microwave photonic subsampling digital receiver(MPSDR)is proposed and experimentally demonstrated for target detection with a sampling rate of 10 MSa/s.Stepped and pseudo-random frequency-hopping signals with frequencies across the K band are both used for target detection and can be captured by the MPSDR.The range profiles of the targets are then derived using a compressed sensing algorithm,and precise target position estimation is achieved by changing the measurement position of the antenna pair.The results demonstrate that the estimation accuracy remains comparable even when the pseudo-random frequency-hopping signal utilizes only 12.5%of the frequency points required by the stepped frequency-hopping signal.This highlights the efficiency and potential of the proposed MPSDR in processing complex signals while maintaining high accuracy.展开更多
In this paper, we consider the unified optimal subsampling estimation and inference on the lowdimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensionalgeneralized linear...In this paper, we consider the unified optimal subsampling estimation and inference on the lowdimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensionalgeneralized linear models (GLMs) with massive data. We first present a general subsampling decorrelated scorefunction to reduce the influence of the less accurate nuisance parameter estimation with the slow convergencerate. The consistency and asymptotic normality of the resultant subsample estimator from a general decorrelatedscore subsampling algorithm are established, and two optimal subsampling probabilities are derived under theA- and L-optimality criteria to downsize the data volume and reduce the computational burden. The proposedoptimal subsampling probabilities provably improve the asymptotic efficiency of the subsampling schemes in thelow-dimensional GLMs and perform better than the uniform subsampling scheme in the high-dimensional GLMs.A two-step algorithm is further proposed to implement, and the asymptotic properties of the correspondingestimators are also given. Simulations show satisfactory performance of the proposed estimators, and twoapplications to census income and Fashion-MNIST datasets also demonstrate its practical applicability.展开更多
The dissemination of news is a vital topic in management science,social science and data science.With the development of technology,the sample sizes and dimensions of digital news data increase remarkably.To alleviate...The dissemination of news is a vital topic in management science,social science and data science.With the development of technology,the sample sizes and dimensions of digital news data increase remarkably.To alleviate the computational burden in big data,this paper proposes a method to deal with massive and moderate-dimensional data for linear regression models via combing model averaging and subsampling methodologies.The author first samples a subsample from the full data according to some special probabilities and split covariates into several groups to construct candidate models.Then,the author solves each candidate model and calculates the model-averaging weights to combine these estimators based on this subsample.Additionally,the asymptotic optimality in subsampling form is proved and the way to calculate optimal subsampling probabilities is provided.The author also illustrates the proposed method via simulations,which shows it takes less running time than that of the full data and generates more accurate estimations than uniform subsampling.Finally,the author applies the proposed method to analyze and predict the sharing number of news,and finds the topic,vocabulary and dissemination time are the determinants.展开更多
Statistical machine learning models should be evaluated and validated before putting to work.Conventional k-fold Monte Carlo cross-validation(MCCV)procedure uses a pseudo-random sequence to partition instances into k ...Statistical machine learning models should be evaluated and validated before putting to work.Conventional k-fold Monte Carlo cross-validation(MCCV)procedure uses a pseudo-random sequence to partition instances into k subsets,which usually causes subsampling bias,inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation.Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory,we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence,which ensures low subsampling bias and leads to more precise expected-prediction-error(EPE)estimates.Experiments with 156 benchmark datasets and three classifiers(logistic regression,decision tree and na?ve bayes)show that in general,our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18%and the variances around 26.73%.In comparison,the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58%and 11.85%,respectively.The leave-one-out(LOO)can lower the EPE around 2.50%but its variances are much higher than the any other cross-validation(CV)procedure.The computational time of our cross-validation procedure is just 8.64%of the MCCV,8.67%of the stratified MCCV and 16.72%of the LOO.Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio.This makes our approach particularly pertinent when solving bioscience classification problems.Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.展开更多
Softmax regression,which is also called multinomial logistic regression,is widely used in various fields for modeling the relationship between covariates and categorical responses with multiple levels.The increasing v...Softmax regression,which is also called multinomial logistic regression,is widely used in various fields for modeling the relationship between covariates and categorical responses with multiple levels.The increasing volumes of data bring new challenges for parameter estimation in softmax regression,and the optimal subsampling method is an effective way to solve them.However,optimal subsampling with replacement requires to access all the sampling probabilities simultaneously to draw a subsample,and the resultant subsample could contain duplicate observations.In this paper,the authors consider Poisson subsampling for its higher estimation accuracy and applicability in the scenario that the data exceed the memory limit.The authors derive the asymptotic properties of the general Poisson subsampling estimator and obtain optimal subsampling probabilities by minimizing the asymptotic variance-covariance matrix under both A-and L-optimality criteria.The optimal subsampling probabilities contain unknown quantities from the full dataset,so the authors suggest an approximately optimal Poisson subsampling algorithm which contains two sampling steps,with the first step as a pilot phase.The authors demonstrate the performance of our optimal Poisson subsampling algorithm through numerical simulations and real data examples.展开更多
Accurate and efficient predictions of the quasiparticle properties of complex materials remain a major challenge due to the convergence issue and the unfavorable scaling of the computational cost with respect to the s...Accurate and efficient predictions of the quasiparticle properties of complex materials remain a major challenge due to the convergence issue and the unfavorable scaling of the computational cost with respect to the system size.Quasiparticle GW calculations for two-dimensional(2D)materials are especially difficult.The unusual analytical behaviors of the dielectric screening and the electron self-energy of 2D materials make the conventional Brillouin zone(BZ)integration approach rather inefficient and require an extremely dense k-grid to properly converge the calculated quasiparticle energies.In this work,we present a combined nonuniform subsampling and analytical integration method that can drastically improve the efficiency of the BZ integration in 2D GW calculations.展开更多
Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering(CF)algorithms.The effect of this subsampling on the computing time and accuracy of CF is not fully understo...Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering(CF)algorithms.The effect of this subsampling on the computing time and accuracy of CF is not fully understood,and clear guidelines for selecting optimal or even appropriate subsampling levels are not available.In this paper,we present a Density-based Random Stratified Subsampling using Clustering(DRSC)algorithm in which the desired Fraction of Users Dropped(FUD)and Fraction of Items Dropped(FID)are specified,and the overall density during subsampling is maintained.Subsequently,we develop simple models of the Training Time Improvement(TTI)and the Accuracy Loss(AL)as functions of FUD and FID,based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens,Yahoo Music Rating,and Amazon Automotive data.Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods.The TTI linear regression of a CF method appears to be same for all datasets.Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only,but AL requires considering additional dataset characteristics.The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL.A simple sub-optimal approximation was found,in which the optimal AL is proportional to the optimal Training Time Reduction Factor(TTRF)for higher values of TTRF,and the optimal subsampling levels,like optimal FID/(1-FID),are proportional to the square root of TTRF.展开更多
Many video surveillance applications rely on efficient motion detection. However, the algorithms are usually costly since they compute a background model at every pixel of the frame. This paper shows that, in the case...Many video surveillance applications rely on efficient motion detection. However, the algorithms are usually costly since they compute a background model at every pixel of the frame. This paper shows that, in the case of a planar scene with a fixed calibrated camera, a set of pixels can be selected to compute the background model while ignoring the other pixels for accurate but less costly motion detection. The cali- bration is used to first define a volume of interest in the real world and to project the volume of interest onto the image, and to define a spatial adaptive subsampling of this region of interest with a subsampling density that depends on the camera distance. Indeed, farther objects need to be analyzed with more precision than closer objects. Tests on many video sequences have integrated this adaptive subsampling to various motion detection techniques.展开更多
We postulate and analyze a nonlinear subsampling accuracy loss(SSAL)model based on the root mean square error(RMSE)and two SSAL models based on the mean square error(MSE),suggested by extensive preliminary simulations...We postulate and analyze a nonlinear subsampling accuracy loss(SSAL)model based on the root mean square error(RMSE)and two SSAL models based on the mean square error(MSE),suggested by extensive preliminary simulations.The SSAL models predict accuracy loss in terms of subsampling parameters like the fraction of users dropped(FUD)and the fraction of items dropped(FID).We seek to investigate whether the models depend on the characteristics of the dataset in a constant way across datasets when using the SVD collaborative filtering(CF)algorithm.The dataset characteristics considered include various densities of the rating matrix and the numbers of users and items.Extensive simulations and rigorous regression analysis led to empirical symmetrical SSAL models in terms of FID and FUD whose coefficients depend only on the data characteristics.The SSAL models came out to be multi-linear in terms of odds ratios of dropping a user(or an item)vs.not dropping it.Moreover,one MSE deterioration model turned out to be linear in the FID and FUD odds where their interaction term has a zero coefficient.Most importantly,the models are constant in the sense that they are written in closed-form using the considered data characteristics(densities and numbers of users and items).The models are validated through extensive simulations based on 850 synthetically generated primary(pre-subsampling)matrices derived from the 25M MovieLens dataset.Nearly 460000 subsampled rating matrices were then simulated and subjected to the singular value decomposition(SVD)CF algorithm.Further validation was conducted using the 1M MovieLens and the Yahoo!Music Rating datasets.The models were constant and significant across all 3 datasets.展开更多
This paper presents a selective review of statistical computation methods for massive data analysis.A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades.In ...This paper presents a selective review of statistical computation methods for massive data analysis.A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades.In this work,we focus on three categories of statistical computation methods:(1)distributed computing,(2)subsampling methods,and(3)minibatch gradient techniques.The first class of literature is about distributed computing and focuses on the situation,where the dataset size is too huge to be comfortably handled by one single computer.In this case,a distributed computation system with multiple computers has to be utilized.The second class of literature is about subsampling methods and concerns about the situation,where the blacksample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole.The last class of literature studies those minibatch gradient related optimization techniques,which have been extensively used for optimizing various deep learning models.展开更多
Principal component analysis(PCA)is ubiquitous in statistics and machine learning domains.It is frequently used as an intermediate procedure in various regression and classification problems to reduce the dimensionali...Principal component analysis(PCA)is ubiquitous in statistics and machine learning domains.It is frequently used as an intermediate procedure in various regression and classification problems to reduce the dimensionality of datasets.However,as the size of datasets becomes extremely large,direct application of PCA may not be feasible since loading and storing massive datasets may exceed the computational ability of common machines.To address this problem,subsampling is usually performed,in which a small proportion of the data is used as a surrogate of the entire dataset.This paper proposes an A-optimal subsampling algorithm to decrease the computational cost of PCA for super-large datasets.To be more specific,we establish the consistency and asymptotic normality of the eigenvectors of the subsampled covariance matrix.Subsequently,we derive the optimal subsampling probabilities for PCA based on the A-optimality criterion.We validate the theoretical results by conducting extensive simulation studies.Moreover,the proposed subsampling algorithm for PCA is embedded into a classification procedure for handwriting data to assess its effectiveness in real-world applications.展开更多
基金Supported by the National Natural Science Foundation of China(Nos.31400469,41571495,31770460)the National Science and Technology Basic Research Program(No.2015FY110400-4)+2 种基金the China Three Gorges Corporation Research Project(No.JGJ/0272015)the Key Program of the Chinese Academy of Sciences(Comprehensive Assessment Technology of River Ecology and Environment for the Water Source Region of "South-toNorth Water Diversion Central Route")the Program for Biodiversity Protection(No.2017HB2096001006)
文摘As a less time-consuming procedure, subsampling technology has been widely used in biological monitoring and assessment programs. It is clear that subsampling counts af fect the value of traditional biodiversity indices, but its ef fect on taxonomic distinctness(TD) indices is less well studied. Here, we examined the responses of traditional(species richness, Shannon-Wiener diversity) and TD(average taxonomic distinctness: Δ +, and variation in taxonomic distinctness: Λ +) indices to subsample counts using a random subsampling procedure from 50 to 400 individuals, based on macroinvertebrate datasets from three dif ferent river systems in China. At regional scale, taxa richness asymptotically increased with ?xed-count size; ≥250–300 individuals to express 95% information of the raw data. In contrast, TD indices were less sensitive to the subsampling procedure. At local scale, TD indices were more stable and had less deviation than species richness and Shannon-Wiener index, even at low subsample counts, with ≥100 individuals needed to estimate 95% of the information of the actual Δ + and Λ + in the three river basins. We also found that abundance had a certain ef fect on diversity indices during the subsampling procedure, with dif ferent subsampling counts for species richness and TD indices varying by regions. Therefore, we suggest that TD indices are suitable for biodiversity assessment and environment monitoring. Meanwhile, pilot analyses are necessary when to determine the appropriate subsample counts for bioassessment in a new region or habitat type.
基金financially supported by the Fundamental Research Funds for the Central Universities(No.201822011)the National Natural Science Foundation of China(No.41674118)the National Science and Technology Major Project(No.2016ZX05027002)
文摘Conventional full-waveform inversion is computationally intensive because it considers all shots in each iteration. To tackle this, we establish the number of shots needed and propose multiscale inversion in the frequency domain while using only the shots that are positively correlated with frequency. When using low-frequency data, the method considers only a small number of shots and raw data. More shots are used with increasing frequency. The random-in-group subsampling method is used to rotate the shots between iterations and avoid the loss of shot information. By reducing the number of shots in the inversion, we decrease the computational cost. There is no crosstalk between shots, no noise addition, and no observational limits. Numerical modeling suggests that the proposed method reduces the computing time, is more robust to noise, and produces better velocity models when using data with noise.
文摘We propose a subsampling method for robust estimation of regression models which is built on classical methods such as the least squares method. It makes use of the non-robust nature of the underlying classical method to find a good sample from regression data contaminated with outliers, and then applies the classical method to the good sample to produce robust estimates of the regression model parameters. The subsampling method is a computational method rooted in the bootstrap methodology which trades analytical treatment for intensive computation;it finds the good sample through repeated fitting of the regression model to many random subsamples of the contaminated data instead of through an analytical treatment of the outliers. The subsampling method can be applied to all regression models for which non-robust classical methods are available. In the present paper, we focus on the basic formulation and robustness property of the subsampling method that are valid for all regression models. We also discuss variations of the method and apply it to three examples involving three different regression models.
基金This project was supported by the National Natural Science Foundation of China (60272099) .
文摘A new faster block-matching algorithm (BMA) by using both search candidate and pixd sulzsamplings is proposed. Firstly a pixd-subsampling approach used in adjustable partial distortion search (APDS) is adjusted to visit about half points of all search candidates by subsampling them, using a spiral-scanning path with one skip. Two sdected candidates that have minimal and second minimal block distortion measures are obtained. Then a fine-tune step is taken around them to find the best one. Some analyses are given to approve the rationality of the approach of this paper. Experimental results show that, as compared to APDS, the proposed algorithm can enhance the block-matching speed by about 30% while maintaining its MSE performance very close to that of it. And it performs much better than many other BMAs such as TSS, NTSS, UCDBS and NPDS.
基金supported by National Natural Science Foundation of China(Grant Nos.12401324,12131001,12371259 and 12371260)National Key Research and Development Program of China(Grant No.2020YFA0714102).
文摘Subsampling plays a crucial role in enhancing the efficiency of Markov chain Monte Carlo(MCMC)algorithms.This paper presents a subsampling-based MCMC algorithm aimed at addressing the computational complexity challenges of traditional MCMC methods on large-scale datasets.The proposed approach significantly reduces computational costs by approximating the full data likelihood function using only a subset of the full data in each iteration.The subsampling process is guided by the fidelity to the full data,which is measured by the energy distance.The resulting algorithm,termed the energy distancebased subsampling MCMC(EDSS-MCMC),offers a flexible approach while maintaining the simplicity of the standard MCMC algorithm.Additionally,we provide an analysis of the invariant distribution generated by the EDSS-MCMC algorithm and quantify the total variation norm between this distribution and the target distribution.Numerical experiments demonstrate the outstanding performance of the proposed algorithm on large-scale datasets.Compared with the standard MCMC algorithm and other subsampling MCMC algorithms,the EDSS-MCMC algorithm exhibits advantages in terms of accuracy and computational speed.Therefore,the proposed algorithm holds practical significance in tasks involving large-scale dataset analysis and machine learning.
基金supported by the Innovation Capacity Building Plan(Science and Technology Facilities)of Jiangsu(No.BM2022017)the Fundamental Research Funds for the Central Universities(No.NI023003)。
文摘A microwave photonic subsampling digital receiver(MPSDR)is proposed and experimentally demonstrated for target detection with a sampling rate of 10 MSa/s.Stepped and pseudo-random frequency-hopping signals with frequencies across the K band are both used for target detection and can be captured by the MPSDR.The range profiles of the targets are then derived using a compressed sensing algorithm,and precise target position estimation is achieved by changing the measurement position of the antenna pair.The results demonstrate that the estimation accuracy remains comparable even when the pseudo-random frequency-hopping signal utilizes only 12.5%of the frequency points required by the stepped frequency-hopping signal.This highlights the efficiency and potential of the proposed MPSDR in processing complex signals while maintaining high accuracy.
基金This work was supported by the Fundamental Research Funds for the Central Universities,National Natural Science Foundation of China(Grant No.12271272)and the Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin.
文摘In this paper, we consider the unified optimal subsampling estimation and inference on the lowdimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensionalgeneralized linear models (GLMs) with massive data. We first present a general subsampling decorrelated scorefunction to reduce the influence of the less accurate nuisance parameter estimation with the slow convergencerate. The consistency and asymptotic normality of the resultant subsample estimator from a general decorrelatedscore subsampling algorithm are established, and two optimal subsampling probabilities are derived under theA- and L-optimality criteria to downsize the data volume and reduce the computational burden. The proposedoptimal subsampling probabilities provably improve the asymptotic efficiency of the subsampling schemes in thelow-dimensional GLMs and perform better than the uniform subsampling scheme in the high-dimensional GLMs.A two-step algorithm is further proposed to implement, and the asymptotic properties of the correspondingestimators are also given. Simulations show satisfactory performance of the proposed estimators, and twoapplications to census income and Fashion-MNIST datasets also demonstrate its practical applicability.
基金supported by the National Natural Science Foundation of China under Grant No.12201431the Young Teacher Foundation of Capital University of Economics and Business under Grant Nos.XRZ2022-070 and 00592254413070。
文摘The dissemination of news is a vital topic in management science,social science and data science.With the development of technology,the sample sizes and dimensions of digital news data increase remarkably.To alleviate the computational burden in big data,this paper proposes a method to deal with massive and moderate-dimensional data for linear regression models via combing model averaging and subsampling methodologies.The author first samples a subsample from the full data according to some special probabilities and split covariates into several groups to construct candidate models.Then,the author solves each candidate model and calculates the model-averaging weights to combine these estimators based on this subsample.Additionally,the asymptotic optimality in subsampling form is proved and the way to calculate optimal subsampling probabilities is provided.The author also illustrates the proposed method via simulations,which shows it takes less running time than that of the full data and generates more accurate estimations than uniform subsampling.Finally,the author applies the proposed method to analyze and predict the sharing number of news,and finds the topic,vocabulary and dissemination time are the determinants.
基金supported by the Qilu Youth Scholar Project of Shandong Universitysupported by National Natural Science Foundation of China(Grant No.11531008)+1 种基金the Ministry of Education of China(Grant No.IRT16R43)the Taishan Scholar Project of Shandong Province。
文摘Statistical machine learning models should be evaluated and validated before putting to work.Conventional k-fold Monte Carlo cross-validation(MCCV)procedure uses a pseudo-random sequence to partition instances into k subsets,which usually causes subsampling bias,inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation.Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory,we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence,which ensures low subsampling bias and leads to more precise expected-prediction-error(EPE)estimates.Experiments with 156 benchmark datasets and three classifiers(logistic regression,decision tree and na?ve bayes)show that in general,our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18%and the variances around 26.73%.In comparison,the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58%and 11.85%,respectively.The leave-one-out(LOO)can lower the EPE around 2.50%but its variances are much higher than the any other cross-validation(CV)procedure.The computational time of our cross-validation procedure is just 8.64%of the MCCV,8.67%of the stratified MCCV and 16.72%of the LOO.Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio.This makes our approach particularly pertinent when solving bioscience classification problems.Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.
基金Wang Haiying’s research was partially supported by the National Science Foundation under Grant No.CCF 2105571.
文摘Softmax regression,which is also called multinomial logistic regression,is widely used in various fields for modeling the relationship between covariates and categorical responses with multiple levels.The increasing volumes of data bring new challenges for parameter estimation in softmax regression,and the optimal subsampling method is an effective way to solve them.However,optimal subsampling with replacement requires to access all the sampling probabilities simultaneously to draw a subsample,and the resultant subsample could contain duplicate observations.In this paper,the authors consider Poisson subsampling for its higher estimation accuracy and applicability in the scenario that the data exceed the memory limit.The authors derive the asymptotic properties of the general Poisson subsampling estimator and obtain optimal subsampling probabilities by minimizing the asymptotic variance-covariance matrix under both A-and L-optimality criteria.The optimal subsampling probabilities contain unknown quantities from the full dataset,so the authors suggest an approximately optimal Poisson subsampling algorithm which contains two sampling steps,with the first step as a pilot phase.The authors demonstrate the performance of our optimal Poisson subsampling algorithm through numerical simulations and real data examples.
基金This work is supported by the NSF under Grant Nos DMR-1506669 and DMREF-1626967P.Z.acknowledges the Southern University of Science and Technology(SUSTech)for hosting his extended visit during spring 2019 when he was on sabbatical+3 种基金Work at SUSTech and SHU is supported by the National Natural Science Foundation of China(Nos 51632005,51572167,and 11929401)W.Z.also acknowledges the support from the Guangdong Innovation Research Team Project(No.2017ZT07C062)Guangdong Provincial Key-Lab program(No.2019B030301001)Shenzhen Municipal Key-Lab program(ZDSYS20190902092905285),and the Shenzhen Pengcheng-Scholarship Program.
文摘Accurate and efficient predictions of the quasiparticle properties of complex materials remain a major challenge due to the convergence issue and the unfavorable scaling of the computational cost with respect to the system size.Quasiparticle GW calculations for two-dimensional(2D)materials are especially difficult.The unusual analytical behaviors of the dielectric screening and the electron self-energy of 2D materials make the conventional Brillouin zone(BZ)integration approach rather inefficient and require an extremely dense k-grid to properly converge the calculated quasiparticle energies.In this work,we present a combined nonuniform subsampling and analytical integration method that can drastically improve the efficiency of the BZ integration in 2D GW calculations.
文摘Dropping fractions of users or items judiciously can reduce the computational cost of Collaborative Filtering(CF)algorithms.The effect of this subsampling on the computing time and accuracy of CF is not fully understood,and clear guidelines for selecting optimal or even appropriate subsampling levels are not available.In this paper,we present a Density-based Random Stratified Subsampling using Clustering(DRSC)algorithm in which the desired Fraction of Users Dropped(FUD)and Fraction of Items Dropped(FID)are specified,and the overall density during subsampling is maintained.Subsequently,we develop simple models of the Training Time Improvement(TTI)and the Accuracy Loss(AL)as functions of FUD and FID,based on extensive simulations of seven standard CF algorithms as applied to various primary matrices from MovieLens,Yahoo Music Rating,and Amazon Automotive data.Simulations show that both TTI and a scaled AL are bi-linear in FID and FUD for all seven methods.The TTI linear regression of a CF method appears to be same for all datasets.Extensive simulations illustrate that TTI can be estimated reliably with FUD and FID only,but AL requires considering additional dataset characteristics.The derived models are then used to optimize the levels of subsampling addressing the tradeoff between TTI and AL.A simple sub-optimal approximation was found,in which the optimal AL is proportional to the optimal Training Time Reduction Factor(TTRF)for higher values of TTRF,and the optimal subsampling levels,like optimal FID/(1-FID),are proportional to the square root of TTRF.
基金Supported by the National Natural Science Foundation of China (No. 60872084)the Specialized Research Fund for the Doctoral Program of Higher Education of MOE, China (No. 20060003102)
文摘Many video surveillance applications rely on efficient motion detection. However, the algorithms are usually costly since they compute a background model at every pixel of the frame. This paper shows that, in the case of a planar scene with a fixed calibrated camera, a set of pixels can be selected to compute the background model while ignoring the other pixels for accurate but less costly motion detection. The cali- bration is used to first define a volume of interest in the real world and to project the volume of interest onto the image, and to define a spatial adaptive subsampling of this region of interest with a subsampling density that depends on the camera distance. Indeed, farther objects need to be analyzed with more precision than closer objects. Tests on many video sequences have integrated this adaptive subsampling to various motion detection techniques.
文摘We postulate and analyze a nonlinear subsampling accuracy loss(SSAL)model based on the root mean square error(RMSE)and two SSAL models based on the mean square error(MSE),suggested by extensive preliminary simulations.The SSAL models predict accuracy loss in terms of subsampling parameters like the fraction of users dropped(FUD)and the fraction of items dropped(FID).We seek to investigate whether the models depend on the characteristics of the dataset in a constant way across datasets when using the SVD collaborative filtering(CF)algorithm.The dataset characteristics considered include various densities of the rating matrix and the numbers of users and items.Extensive simulations and rigorous regression analysis led to empirical symmetrical SSAL models in terms of FID and FUD whose coefficients depend only on the data characteristics.The SSAL models came out to be multi-linear in terms of odds ratios of dropping a user(or an item)vs.not dropping it.Moreover,one MSE deterioration model turned out to be linear in the FID and FUD odds where their interaction term has a zero coefficient.Most importantly,the models are constant in the sense that they are written in closed-form using the considered data characteristics(densities and numbers of users and items).The models are validated through extensive simulations based on 850 synthetically generated primary(pre-subsampling)matrices derived from the 25M MovieLens dataset.Nearly 460000 subsampled rating matrices were then simulated and subjected to the singular value decomposition(SVD)CF algorithm.Further validation was conducted using the 1M MovieLens and the Yahoo!Music Rating datasets.The models were constant and significant across all 3 datasets.
基金supported by the National Natural Science Foundation of China[grant numbers 72301070,72171226,12271012,12171020,12071477,72371241,72222009,71991472 and 12331009]the National Statistical Science Research Project[grant number 2023LD008]+3 种基金the Fundamental Research Funds for the Central Universities in UIBE[grant number CXTD13-04]the MOE Project of Key Research Institute of Humanities and Social Sciences[grant number 22JJD110001]the Program for Innovation Research,the disciplinary funding and the Emerging Interdisciplinary Project of Central University of Finance and Economicsthe Postdoctoral Fellowship Program of CPSF[grant numbers GZC20231522,GZC20230111 and GZB20230070].
文摘This paper presents a selective review of statistical computation methods for massive data analysis.A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades.In this work,we focus on three categories of statistical computation methods:(1)distributed computing,(2)subsampling methods,and(3)minibatch gradient techniques.The first class of literature is about distributed computing and focuses on the situation,where the dataset size is too huge to be comfortably handled by one single computer.In this case,a distributed computation system with multiple computers has to be utilized.The second class of literature is about subsampling methods and concerns about the situation,where the blacksample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole.The last class of literature studies those minibatch gradient related optimization techniques,which have been extensively used for optimizing various deep learning models.
基金supported by the National Key R&D Program of China(Grant No.2022YFA1003803)National Social Science Foundation of China(Grant No.21BTJ048)+3 种基金National Natural Science Foundation of China(Grant Nos.12371276 and 12131006)supported by the National Key R&D Program of China(Grant No.2023YFA1008700)National Social Science Foundation of China(Grant No.24BTJ066)National Natural Science Foundation of China(Grant No.12171033)。
文摘Principal component analysis(PCA)is ubiquitous in statistics and machine learning domains.It is frequently used as an intermediate procedure in various regression and classification problems to reduce the dimensionality of datasets.However,as the size of datasets becomes extremely large,direct application of PCA may not be feasible since loading and storing massive datasets may exceed the computational ability of common machines.To address this problem,subsampling is usually performed,in which a small proportion of the data is used as a surrogate of the entire dataset.This paper proposes an A-optimal subsampling algorithm to decrease the computational cost of PCA for super-large datasets.To be more specific,we establish the consistency and asymptotic normality of the eigenvectors of the subsampled covariance matrix.Subsequently,we derive the optimal subsampling probabilities for PCA based on the A-optimality criterion.We validate the theoretical results by conducting extensive simulation studies.Moreover,the proposed subsampling algorithm for PCA is embedded into a classification procedure for handwriting data to assess its effectiveness in real-world applications.