The International Workshop on Applications of Probability and Statistics to Biology(APSB)was successfully held in Shanghai,China,July 11-13,2019.The workshop was hosted by the Institute of Science and Technology for B...The International Workshop on Applications of Probability and Statistics to Biology(APSB)was successfully held in Shanghai,China,July 11-13,2019.The workshop was hosted by the Institute of Science and Technology for Brain-inspired Intelligence(ISTBI)at Fudan University,and in honor of the 80th birthday of Prof.Minping Qian of Peking University.Most of the twenty eight speakers were former students or close collaborators of Prof.Qian;and there were over eighty participants from all over China and United States.展开更多
Background:The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture.Existing reference-based and gene h...Background:The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture.Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.Methods:Here we developed a reference-free and alignment-free machine learning method,DeepVirFinder,for identifying viral sequences in metagenomic data using deep learning.Results'.Trained based on sequences from viral RefSeq discovered before May 2015,and evaluated on those discovered after that date,DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths,achieving AUROC 0.93,0.95,0.97,and 0.98 for 300,500,1000,and 3000 bp sequences respectively.Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented.Applying DeepVirFinder to real human gut metagenomic samples,we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma(CRC).Ten bins were found associated with the cancer status,suggesting viruses may play important roles in CRC.Conclusions:Powered by deep learning and high throughput sequencing metagenomic data,DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.展开更多
The Human Genome Project(HGP)is a historical and landmark scientific project.In spite of initial controversy it has become a bedrock foundation for much progress in biological science and human health.After the Human ...The Human Genome Project(HGP)is a historical and landmark scientific project.In spite of initial controversy it has become a bedrock foundation for much progress in biological science and human health.After the Human Genome Project was completed in the early 2000s,next generation sequencing technologies were developed and that has revolutionized genomics.Here is a brief account of the May 1985 meeting at University of California Santa Cruz.Historical accounts often begin with a the Department of Energy(DOE)meeting in Santa Fe in March 1986 and neglect including the Santa Cruz meeting[1],although sometimes it is discussed[2].展开更多
Background:Markov chains(MC)have been widely used to model molecular sequences.The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensi...Background:Markov chains(MC)have been widely used to model molecular sequences.The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades.In next generation sequencing(NGS),a large amount of short reads are generated.These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data.Based on NGS data,the transition probabilities of MC can be estimated by moment estimators.However,the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.Methods:In this study,we present the asymptotic distributions of several statistics related to MC based on NGS data.We show that,after scaling by the effective coverage d defined in a previous study by the authors,these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.Results:We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data.We validate our theoretical confidence intervals using both simulated data and real data sets,and compare the results with those by the parametric bootstrap method.Conclusions:We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate,providing a powerful tool for NGS data analysis.展开更多
文摘The International Workshop on Applications of Probability and Statistics to Biology(APSB)was successfully held in Shanghai,China,July 11-13,2019.The workshop was hosted by the Institute of Science and Technology for Brain-inspired Intelligence(ISTBI)at Fudan University,and in honor of the 80th birthday of Prof.Minping Qian of Peking University.Most of the twenty eight speakers were former students or close collaborators of Prof.Qian;and there were over eighty participants from all over China and United States.
基金The research was supported by the U.S.National Institutes of Health R01GM120624,National Science Foundation DMS-1518001,National Natural Science Foundation of China(11701546)the Simons Collaboration on Computational Biogeochemical Modeling of Marine Ecosystems(CBIOMES+1 种基金grant ID 549943)We thank Drs.Michael S.Waterman,Gesine Reinert,Ying Wang,Rui Jiang,Yang Lu,Lizzie Dorfrnan,Mr.Weili Wang,and Mr.Luigi Manna for helpful discussions and suggestions.We thank USC Center for High Performance Computing(HPC)for helping us use their cluster computers.
文摘Background:The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture.Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.Methods:Here we developed a reference-free and alignment-free machine learning method,DeepVirFinder,for identifying viral sequences in metagenomic data using deep learning.Results'.Trained based on sequences from viral RefSeq discovered before May 2015,and evaluated on those discovered after that date,DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths,achieving AUROC 0.93,0.95,0.97,and 0.98 for 300,500,1000,and 3000 bp sequences respectively.Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented.Applying DeepVirFinder to real human gut metagenomic samples,we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma(CRC).Ten bins were found associated with the cancer status,suggesting viruses may play important roles in CRC.Conclusions:Powered by deep learning and high throughput sequencing metagenomic data,DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
文摘The Human Genome Project(HGP)is a historical and landmark scientific project.In spite of initial controversy it has become a bedrock foundation for much progress in biological science and human health.After the Human Genome Project was completed in the early 2000s,next generation sequencing technologies were developed and that has revolutionized genomics.Here is a brief account of the May 1985 meeting at University of California Santa Cruz.Historical accounts often begin with a the Department of Energy(DOE)meeting in Santa Fe in March 1986 and neglect including the Santa Cruz meeting[1],although sometimes it is discussed[2].
基金Supported by NSFC grants(Nos.11571349 and 91630314)the National Key R&D Program of China under Grant 2018YFB0704304,NCMIS of CAS,LSC of CAS+1 种基金the Youth Innovation Promotion Association of CAS.JR and FS were supported by US National Science Foundation(NSF)(DMS-1518001)National Institutes of Health(NIH)(R01GM120624,1R01GM131407).
文摘Background:Markov chains(MC)have been widely used to model molecular sequences.The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades.In next generation sequencing(NGS),a large amount of short reads are generated.These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data.Based on NGS data,the transition probabilities of MC can be estimated by moment estimators.However,the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.Methods:In this study,we present the asymptotic distributions of several statistics related to MC based on NGS data.We show that,after scaling by the effective coverage d defined in a previous study by the authors,these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.Results:We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data.We validate our theoretical confidence intervals using both simulated data and real data sets,and compare the results with those by the parametric bootstrap method.Conclusions:We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate,providing a powerful tool for NGS data analysis.