期刊文献+
共找到59篇文章
< 1 2 3 >
每页显示 20 50 100
Adaptive feature selection method for high-dimensional imbalanced data classification
1
作者 WU Jianzhen XUE Zhen +1 位作者 ZHANG Liangliang YANG Xu 《Journal of Measurement Science and Instrumentation》 2025年第4期612-624,共13页
Data collected in fields such as cybersecurity and biomedicine often encounter high dimensionality and class imbalance.To address the problem of low classification accuracy for minority class samples arising from nume... Data collected in fields such as cybersecurity and biomedicine often encounter high dimensionality and class imbalance.To address the problem of low classification accuracy for minority class samples arising from numerous irrelevant and redundant features in high-dimensional imbalanced data,we proposed a novel feature selection method named AMF-SGSK based on adaptive multi-filter and subspace-based gaining sharing knowledge.Firstly,the balanced dataset was obtained by random under-sampling.Secondly,combining the feature importance score with the AUC score for each filter method,we proposed a concept called feature hardness to judge the importance of feature,which could adaptively select the essential features.Finally,the optimal feature subset was obtained by gaining sharing knowledge in multiple subspaces.This approach effectively achieved dimensionality reduction for high-dimensional imbalanced data.The experiment results on 30 benchmark imbalanced datasets showed that AMF-SGSK performed better than other eight commonly used algorithms including BGWO and IG-SSO in terms of F1-score,AUC,and G-mean.The mean values of F1-score,AUC,and Gmean for AMF-SGSK are 0.950,0.967,and 0.965,respectively,achieving the highest among all algorithms.And the mean value of Gmean is higher than those of IG-PSO,ReliefF-GWO,and BGOA by 3.72%,11.12%,and 20.06%,respectively.Furthermore,the selected feature ratio is below 0.01 across the selected ten datasets,further demonstrating the proposed method’s overall superiority over competing approaches.AMF-SGSK could adaptively remove irrelevant and redundant features and effectively improve the classification accuracy of high-dimensional imbalanced data,providing scientific and technological references for practical applications. 展开更多
关键词 high-dimensional imbalanced data adaptive feature selection adaptive multi-filter feature hardness gaining sharing knowledge based algorithm metaheuristic algorithm
在线阅读 下载PDF
Similarity measurement method of high-dimensional data based on normalized net lattice subspace 被引量:4
2
作者 李文法 Wang Gongming +1 位作者 Li Ke Huang Su 《High Technology Letters》 EI CAS 2017年第2期179-184,共6页
The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities... The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity,leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals,and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this method,three data types are used,and seven common similarity measurement methods are compared.The experimental result indicates that the relative difference of the method is increasing with the dimensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition,the similarity range of this method in different dimensions is [0,1],which is fit for similarity analysis after dimensionality reduction. 展开更多
关键词 high-dimensional data the curse of dimensionality SIMILARITY NORMALIZATION SUBSPACE NPsim
在线阅读 下载PDF
Generalized Functional Linear Models:Efficient Modeling for High-dimensional Correlated Mixture Exposures
3
作者 Bingsong Zhang Haibin Yu +11 位作者 Xin Peng Haiyi Yan Siran Li Shutong Luo Renhuizi Wei Zhujiang Zhou Yalin Kuang Yihuan Zheng Chulan Ou Linhua Liu Yuehua Hu Jindong Ni 《Biomedical and Environmental Sciences》 2025年第8期961-976,共16页
Objective Humans are exposed to complex mixtures of environmental chemicals and other factors that can affect their health.Analysis of these mixture exposures presents several key challenges for environmental epidemio... Objective Humans are exposed to complex mixtures of environmental chemicals and other factors that can affect their health.Analysis of these mixture exposures presents several key challenges for environmental epidemiology and risk assessment,including high dimensionality,correlated exposure,and subtle individual effects.Methods We proposed a novel statistical approach,the generalized functional linear model(GFLM),to analyze the health effects of exposure mixtures.GFLM treats the effect of mixture exposures as a smooth function by reordering exposures based on specific mechanisms and capturing internal correlations to provide a meaningful estimation and interpretation.The robustness and efficiency was evaluated under various scenarios through extensive simulation studies.Results We applied the GFLM to two datasets from the National Health and Nutrition Examination Survey(NHANES).In the first application,we examined the effects of 37 nutrients on BMI(2011–2016 cycles).The GFLM identified a significant mixture effect,with fiber and fat emerging as the nutrients with the greatest negative and positive effects on BMI,respectively.For the second application,we investigated the association between four pre-and perfluoroalkyl substances(PFAS)and gout risk(2007–2018 cycles).Unlike traditional methods,the GFLM indicated no significant association,demonstrating its robustness to multicollinearity.Conclusion GFLM framework is a powerful tool for mixture exposure analysis,offering improved handling of correlated exposures and interpretable results.It demonstrates robust performance across various scenarios and real-world applications,advancing our understanding of complex environmental exposures and their health impacts on environmental epidemiology and toxicology. 展开更多
关键词 Mixture exposure modeling Functional data analysis high-dimensional data Correlated exposures Environmental epidemiology
暂未订购
A nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix
4
作者 李文法 Wang Gongming +1 位作者 Ma Nan Liu Hongzhe 《High Technology Letters》 EI CAS 2016年第3期241-247,共7页
Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculat... Problems existin similarity measurement and index tree construction which affect the performance of nearest neighbor search of high-dimensional data. The equidistance problem is solved using NPsim function to calculate similarity. And a sequential NPsim matrix is built to improve indexing performance. To sum up the above innovations,a nearest neighbor search algorithm of high-dimensional data based on sequential NPsim matrix is proposed in comparison with the nearest neighbor search algorithms based on KD-tree or SR-tree on Munsell spectral data set. Experimental results show that the proposed algorithm similarity is better than that of other algorithms and searching speed is more than thousands times of others. In addition,the slow construction speed of sequential NPsim matrix can be increased by using parallel computing. 展开更多
关键词 nearest neighbor search high-dimensional data SIMILARITY indexing tree NPsim KD-TREE SR-tree Munsell
在线阅读 下载PDF
A State-Migration Particle Swarm Optimizer for Adaptive Latent Factor Analysis of High-Dimensional and Incomplete Data
5
作者 Jiufang Chen Kechen Liu +4 位作者 Xin Luo Ye Yuan Khaled Sedraoui Yusuf Al-Turki MengChu Zhou 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2024年第11期2220-2235,共16页
High-dimensional and incomplete(HDI) matrices are primarily generated in all kinds of big-data-related practical applications. A latent factor analysis(LFA) model is capable of conducting efficient representation lear... High-dimensional and incomplete(HDI) matrices are primarily generated in all kinds of big-data-related practical applications. A latent factor analysis(LFA) model is capable of conducting efficient representation learning to an HDI matrix,whose hyper-parameter adaptation can be implemented through a particle swarm optimizer(PSO) to meet scalable requirements.However, conventional PSO is limited by its premature issues,which leads to the accuracy loss of a resultant LFA model. To address this thorny issue, this study merges the information of each particle's state migration into its evolution process following the principle of a generalized momentum method for improving its search ability, thereby building a state-migration particle swarm optimizer(SPSO), whose theoretical convergence is rigorously proved in this study. It is then incorporated into an LFA model for implementing efficient hyper-parameter adaptation without accuracy loss. Experiments on six HDI matrices indicate that an SPSO-incorporated LFA model outperforms state-of-the-art LFA models in terms of prediction accuracy for missing data of an HDI matrix with competitive computational efficiency.Hence, SPSO's use ensures efficient and reliable hyper-parameter adaptation in an LFA model, thus ensuring practicality and accurate representation learning for HDI matrices. 展开更多
关键词 data science generalized momentum high-dimensional and incomplete(HDI)data hyper-parameter adaptation latent factor analysis(LFA) particle swarm optimization(PSO)
在线阅读 下载PDF
Censored Composite Conditional Quantile Screening for High-Dimensional Survival Data
6
作者 LIU Wei LI Yingqiu 《应用概率统计》 CSCD 北大核心 2024年第5期783-799,共17页
In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all usef... In this paper,we introduce the censored composite conditional quantile coefficient(cC-CQC)to rank the relative importance of each predictor in high-dimensional censored regression.The cCCQC takes advantage of all useful information across quantiles and can detect nonlinear effects including interactions and heterogeneity,effectively.Furthermore,the proposed screening method based on cCCQC is robust to the existence of outliers and enjoys the sure screening property.Simulation results demonstrate that the proposed method performs competitively on survival datasets of high-dimensional predictors,particularly when the variables are highly correlated. 展开更多
关键词 high-dimensional survival data censored composite conditional quantile coefficient sure screening property rank consistency property
在线阅读 下载PDF
Dimensionality Reduction of High-Dimensional Highly Correlated Multivariate Grapevine Dataset
7
作者 Uday Kant Jha Peter Bajorski +3 位作者 Ernest Fokoue Justine Vanden Heuvel Jan van Aardt Grant Anderson 《Open Journal of Statistics》 2017年第4期702-717,共16页
Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripeni... Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status;this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability. 展开更多
关键词 high-dimensional data MULTI-STEP Adaptive Elastic Net MINIMAX CONCAVE Penalty Sure Independence Screening Functional data Analysis
暂未订购
Optimal Estimation of High-Dimensional Covariance Matrices with Missing and Noisy Data
8
作者 Meiyin Wang Wanzhou Ye 《Advances in Pure Mathematics》 2024年第4期214-227,共14页
The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based o... The estimation of covariance matrices is very important in many fields, such as statistics. In real applications, data are frequently influenced by high dimensions and noise. However, most relevant studies are based on complete data. This paper studies the optimal estimation of high-dimensional covariance matrices based on missing and noisy sample under the norm. First, the model with sub-Gaussian additive noise is presented. The generalized sample covariance is then modified to define a hard thresholding estimator , and the minimax upper bound is derived. After that, the minimax lower bound is derived, and it is concluded that the estimator presented in this article is rate-optimal. Finally, numerical simulation analysis is performed. The result shows that for missing samples with sub-Gaussian noise, if the true covariance matrix is sparse, the hard thresholding estimator outperforms the traditional estimate method. 展开更多
关键词 high-dimensional Covariance Matrix Missing data Sub-Gaussian Noise Optimal Estimation
在线阅读 下载PDF
Making Short-term High-dimensional Data Predictable
9
作者 CHEN Luonan 《Bulletin of the Chinese Academy of Sciences》 2018年第4期243-244,共2页
Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available f... Making accurate forecast or prediction is a challenging task in the big data era, in particular for those datasets involving high-dimensional variables but short-term time series points,which are generally available from real-world systems.To address this issue, Prof. 展开更多
关键词 RDE MAKING SHORT-TERM high-dimensional data Predictable
在线阅读 下载PDF
Randomized Latent Factor Model for High-dimensional and Sparse Matrices from Industrial Applications 被引量:14
10
作者 Mingsheng Shang Xin Luo +3 位作者 Zhigang Liu Jia Chen Ye Yuan MengChu Zhou 《IEEE/CAA Journal of Automatica Sinica》 EI CSCD 2019年第1期131-141,共11页
Latent factor(LF)models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS)matrices which are commonly seen in various industrial applications.An LF model usually adopts iterativ... Latent factor(LF)models are highly effective in extracting useful knowledge from High-Dimensional and Sparse(HiDS)matrices which are commonly seen in various industrial applications.An LF model usually adopts iterative optimizers,which may consume many iterations to achieve a local optima,resulting in considerable time cost.Hence,determining how to accelerate the training process for LF models has become a significant issue.To address this,this work proposes a randomized latent factor(RLF)model.It incorporates the principle of randomized learning techniques from neural networks into the LF analysis of HiDS matrices,thereby greatly alleviating computational burden.It also extends a standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly.Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models,RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data.I provides an important alternative approach to LF analysis of HiDS matrices,which is especially desired for industrial applications demanding highly efficient models. 展开更多
关键词 Big data high-dimensional and sparse matrix latent factor analysis latent factor model randomized learning
在线阅读 下载PDF
CSFW-SC: Cuckoo Search Fuzzy-Weighting Algorithm for Subspace Clustering Applying to High-Dimensional Clustering 被引量:1
11
作者 WANG Jindong HE Jiajing +1 位作者 ZHANG Hengwei YU Zhiyong 《China Communications》 SCIE CSCD 2015年第S2期55-63,共9页
Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subsp... Aimed at the issue that traditional clustering methods are not appropriate to high-dimensional data, a cuckoo search fuzzy-weighting algorithm for subspace clustering is presented on the basis of the exited soft subspace clustering algorithm. In the proposed algorithm, a novel objective function is firstly designed by considering the fuzzy weighting within-cluster compactness and the between-cluster separation, and loosening the constraints of dimension weight matrix. Then gradual membership and improved Cuckoo search, a global search strategy, are introduced to optimize the objective function and search subspace clusters, giving novel learning rules for clustering. At last, the performance of the proposed algorithm on the clustering analysis of various low and high dimensional datasets is experimentally compared with that of several competitive subspace clustering algorithms. Experimental studies demonstrate that the proposed algorithm can obtain better performance than most of the existing soft subspace clustering algorithms. 展开更多
关键词 high-dimensional data CLUSTERING soft SUBSPACE CUCKOO SEARCH FUZZY CLUSTERING
在线阅读 下载PDF
A Length-Adaptive Non-Dominated Sorting Genetic Algorithm for Bi-Objective High-Dimensional Feature Selection 被引量:1
12
作者 Yanlu Gong Junhai Zhou +2 位作者 Quanwang Wu MengChu Zhou Junhao Wen 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2023年第9期1834-1844,共11页
As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected featu... As a crucial data preprocessing method in data mining,feature selection(FS)can be regarded as a bi-objective optimization problem that aims to maximize classification accuracy and minimize the number of selected features.Evolutionary computing(EC)is promising for FS owing to its powerful search capability.However,in traditional EC-based methods,feature subsets are represented via a length-fixed individual encoding.It is ineffective for high-dimensional data,because it results in a huge search space and prohibitive training time.This work proposes a length-adaptive non-dominated sorting genetic algorithm(LA-NSGA)with a length-variable individual encoding and a length-adaptive evolution mechanism for bi-objective highdimensional FS.In LA-NSGA,an initialization method based on correlation and redundancy is devised to initialize individuals of diverse lengths,and a Pareto dominance-based length change operator is introduced to guide individuals to explore in promising search space adaptively.Moreover,a dominance-based local search method is employed for further improvement.The experimental results based on 12 high-dimensional gene datasets show that the Pareto front of feature subsets produced by LA-NSGA is superior to those of existing algorithms. 展开更多
关键词 Bi-objective optimization feature selection(FS) genetic algorithm high-dimensional data length-adaptive
在线阅读 下载PDF
Observation points classifier ensemble for high-dimensional imbalanced classification 被引量:1
13
作者 Yulin He Xu Li +3 位作者 Philippe Fournier‐Viger Joshua Zhexue Huang Mianjie Li Salman Salloum 《CAAI Transactions on Intelligence Technology》 SCIE EI 2023年第2期500-517,共18页
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)... In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems. 展开更多
关键词 classifier ensemble feature transformation high-dimensional data classification imbalanced learning observation point mechanism
在线阅读 下载PDF
Variance Estimation for High-Dimensional Varying Index Coefficient Models
14
作者 Miao Wang Hao Lv Yicun Wang 《Open Journal of Statistics》 2019年第5期555-570,共16页
This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficient... This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares. 展开更多
关键词 high-dimensional data Refitted Cross-Validation VARYING INDEX COEFFICIENT MODELS Variance ESTIMATION
在线阅读 下载PDF
A Test on High-Dimensional Intraclass Correlation Structure
15
作者 TANG Ping XIAO Nan-nan XIE Jun-shan 《Chinese Quarterly Journal of Mathematics》 2022年第1期10-25,共16页
The paper considers a high-dimensional likelihood ratio(LR)test on the intraclass correlation structure of the multivariate normal population.When the dimension p and sample size N satisfy N−1>p→∞,it is proved th... The paper considers a high-dimensional likelihood ratio(LR)test on the intraclass correlation structure of the multivariate normal population.When the dimension p and sample size N satisfy N−1>p→∞,it is proved that the logarithmic LR statistic asymptotically obeys Gaussian distribution,and the explicit expressions of the mean and the variance are also obtained.The simulations demonstrate that our high-dimensional LR test method outperforms the traditional Chi-square approximation method or F-approximation method,and performs as efficient as the accurate high-dimensional Edgeworth expansion method and the more accurate high-dimensional Edgeworth expansion method in analyzing the intraclass covariance structure of highdimensional data. 展开更多
关键词 Likelihood ratio test high-dimensional data Intraclass correlation structure
在线阅读 下载PDF
Inference for High-Dimensional Streamed Longitudinal Data
16
作者 Senyuan Zheng Ling Zhou 《Acta Mathematica Sinica,English Series》 2025年第2期757-779,共23页
With the advent of modern devices,such as smartphones and wearable devices,high-dimensional data are collected on many participants for a period of time or even in perpetuity.For this type of data,dependencies between... With the advent of modern devices,such as smartphones and wearable devices,high-dimensional data are collected on many participants for a period of time or even in perpetuity.For this type of data,dependencies between and within data batches exist because data are collected from the same individual over time.Under the framework of streamed data,individual historical data are not available due to the storage and computation burden.It is urgent to develop computationally efficient methods with statistical guarantees to analyze high-dimensional streamed data and make reliable inferences in practice.In addition,the homogeneity assumption on the model parameters may not be valid in practice over time.To address the above issues,in this paper,we develop a new renewable debiased-lasso inference method for high-dimensional streamed data allowing dependences between and within data batches to exist and model parameters to gradually change.We establish the large sample properties of the proposed estimators,including consistency and asymptotic normality.The numerical results,including simulations and real data analysis,show the superior performance of the proposed method. 展开更多
关键词 Debiased lasso high-dimensional inference streamed longitudinal data renewable inference
原文传递
High-dimensional large-scale mixed-type data imputation under missing at random
17
作者 Wei Liu Guizhen Li +1 位作者 Ling Zhou Lan Luo 《Science China Mathematics》 2025年第4期969-1000,共32页
Missingness in mixed-type variables is commonly encountered in a variety of areas.The requirement of complete observations necessitates data imputation when a moderate or large proportion of data is missing.However,in... Missingness in mixed-type variables is commonly encountered in a variety of areas.The requirement of complete observations necessitates data imputation when a moderate or large proportion of data is missing.However,inappropriate imputation would downgrade the performance of machine learning algorithms,leading to bad predictions and unreliable statistical inference.For high-dimensional large-scale mixed-type missing data,we develop a computationally efficient imputation method,missing value imputation via generalized factor models(MIG),under missing at random.The proposed MIG method allows missing variables to be of different types,including continuous,binary,and count variables,and are scalable to both data size n and variable dimension p while existing imputation methods rely on restrictive assumptions such as the same type of missing variables,the low dimensionality of variables,and a limited sample size.We explicitly show that the imputation error of the proposed MIG method diminishes to zero with the rate Op(max{n^(-1/2),p^(-1/2)})as both n and p tend to infinity.Five real datasets demonstrate the superior empirical performance of the proposed MIG method over existing methods that the average normalized absolute imputation error is reduced by 5.3%–34.1%. 展开更多
关键词 IMPUTATION high-dimensional mixed-type data missing at random generalized factor model
原文传递
Effects of subsampling on characteristics of RNA-seq data from triple-negative breast cancer patients
18
作者 Alexey Stupnikov Galina V Glazko Frank Emmert-Streib 《Chinese Journal of Cancer》 SCIE CAS CSCD 2015年第10期427-438,共12页
Background:Data from RNA-seq experiments provide a wealth of information about the transcriptome of an organism.However,the analysis of such data is very demanding.In this study,we aimed to establish robust analysis p... Background:Data from RNA-seq experiments provide a wealth of information about the transcriptome of an organism.However,the analysis of such data is very demanding.In this study,we aimed to establish robust analysis procedures that can be used in clinical practice.Methods:We studied RNA-seq data from triple-negative breast cancer patients.Specifically,we investigated the subsampling of RNA-seq data.Results:The main results of our investigations are as follows:(1) the subsampling of RNA-seq data gave biologically realistic simulations of sequencing experiments with smaller sequencing depth but not direct scaling of count matrices;(2) the saturation of results required an average sequencing depth larger than 32 million reads and an individual sequencing depth larger than 46 million reads;and(3) for an abrogated feature selection,higher moments of the distribution of all expressed genes had a higher sensitivity for signal detection than the corresponding mean values.Conclusions:Our results reveal important characteristics of RNA-seq data that must be understood before one can apply such an approach to translational medicine. 展开更多
关键词 RNA-SEQ data Computational genomics Statistical robustness high-dimensional biology Triple-negative breast cancer
暂未订购
Data-Aware Partitioning Schema in MapReduce
19
作者 Liang Junjie Liu Qiongni +1 位作者 Yin Li Yu Dunhui 《国际计算机前沿大会会议论文集》 2015年第1期28-29,共2页
With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Dataaware partitioning schema in MapReduce for large-scale high-dimensional data is ... With the advantages of MapReduce programming model in parallel computing and processing of data and tasks on large-scale clusters, a Dataaware partitioning schema in MapReduce for large-scale high-dimensional data is proposed. It optimizes partition method of data blocks with the same contribution to computation in MapReduce. Using a two-stage data partitioning strategy, the data are uniformly distributed into data blocks by clustering and partitioning. The experiments show that the data-aware partitioning schema is very effective and extensible for improving the query efficiency of highdimensional data. 展开更多
关键词 CLOUD COMPUTING MAPREDUCE high-dimensional data dataaware partitioning
在线阅读 下载PDF
上一页 1 2 3 下一页 到第
使用帮助 返回顶部