The Genome Analysis Toolkit(GATK) is a popular set of programs for discovering and genotyping variants from next-generation sequencing data.The current GATK recommendation for RNA sequencing(RNA-seq) is to perform var...The Genome Analysis Toolkit(GATK) is a popular set of programs for discovering and genotyping variants from next-generation sequencing data.The current GATK recommendation for RNA sequencing(RNA-seq) is to perform variant calling from individual samples,with the drawback that only variable positions are reported.Versions 3.0 and above of GATK offer the possibility of calling DNA variants on cohorts of samples using the HaplotypeCaller algorithm in Genomic Variant Call Format(GVCF) mode.Using this approach,variants are called individually on each sample,generating one GVCF file per sample that lists genotype likelihoods and their genome annotations.In a second step,variants are called from the GVCF files through a joint genotyping analysis.This strategy is more flexible and reduces computational challenges in comparison to the traditional joint discovery workflow.Using a GVCF workflow for mining SNP in RNA-seq data provides substantial advantages,including reporting homozygous genotypes for the reference allele as well as missing data.Taking advantage of RNA-seq data derived from primary macrophages isolated from 50 cows,the GATK joint genotyping method for calling variants on RNA-seq data was validated by comparing this approach to a so-called "per-sample" method.In addition,pair-wise comparisons of the two methods were performed to evaluate their respective sensitivity,precision and accuracy using DNA genotypes from a companion study including the same 50 cows genotyped using either genotyping-by-sequencing or with the Bovine SNP50 Beadchip(imputed to the Bovine high density).Results indicate that both approaches are very close in their capacity of detecting reference variants and that the joint genotyping method is more sensitive than the per-sample method.Given that the joint genotyping method is more flexible and technically easier,we recommend this approach for variant calling in RNA-seq experiments.展开更多
Slow speed of the Next-Generation sequencing data analysis, compared to the latest high throughput sequencers such as HiSeq X system, using the current industry standard genome analysis pipeline, has been the major fa...Slow speed of the Next-Generation sequencing data analysis, compared to the latest high throughput sequencers such as HiSeq X system, using the current industry standard genome analysis pipeline, has been the major factor of data backlog which limits the real-time use of genomic data for precision medicine. This study demonstrates the DRAGEN Bio-IT Processor as a potential candidate to remove the “Big Data Bottleneck”. DRAGENTM accomplished the variant calling, for ~40× coverage WGS data in as low as ~30 minutes using a single command, achieving the over 50-fold data analysis speed while maintaining the similar or better variant calling accuracy than the standard GATK Best Practices workflow. This systematic comparison provides the faster and efficient NGS data analysis alternative to NGS-based healthcare industries and research institutes to meet the requirement for precision medicine based healthcare.展开更多
Bacterial genome sequencing is a powerful technique for studying the genetic diversity and evolution ofmicrobial populations.However,the detection of genomic variants from sequencing data is challenging due to the pre...Bacterial genome sequencing is a powerful technique for studying the genetic diversity and evolution ofmicrobial populations.However,the detection of genomic variants from sequencing data is challenging due to the presence of contamination,sequencing errors and multiple strains within the same species.Several bioinformatics tools have been developed to address these issues,but their performance and accuracy have not been systematically evaluated.In this study,we compared 10 variant detection pipelines using 18 simulated and 17 real datasets of high-throughput sequences froma bundle of representative bacteria.We assessed the sensitivity of each pipeline under different conditions of coverage,simulation and strain diversity.We also demonstrated the application of these tools to identify consistentmutations in a 30-time repeated sequencing dataset of Staphylococcus hominis.We found that HaplotypeCaller,but not Mutect2,from the GATK tool set showed the best performance in terms of accuracy and robustness.CFSAN and Snippy performed not as well in several simulated and real sequencing datasets.Our results provided a comprehensive benchmark and guidance for choosing the optimal variant detection pipeline for high-throughput bacterial genome sequencing data.展开更多
基金This study was funded by Agri-Food and Agriculture Canada(Project AAFC J0000–75)
文摘The Genome Analysis Toolkit(GATK) is a popular set of programs for discovering and genotyping variants from next-generation sequencing data.The current GATK recommendation for RNA sequencing(RNA-seq) is to perform variant calling from individual samples,with the drawback that only variable positions are reported.Versions 3.0 and above of GATK offer the possibility of calling DNA variants on cohorts of samples using the HaplotypeCaller algorithm in Genomic Variant Call Format(GVCF) mode.Using this approach,variants are called individually on each sample,generating one GVCF file per sample that lists genotype likelihoods and their genome annotations.In a second step,variants are called from the GVCF files through a joint genotyping analysis.This strategy is more flexible and reduces computational challenges in comparison to the traditional joint discovery workflow.Using a GVCF workflow for mining SNP in RNA-seq data provides substantial advantages,including reporting homozygous genotypes for the reference allele as well as missing data.Taking advantage of RNA-seq data derived from primary macrophages isolated from 50 cows,the GATK joint genotyping method for calling variants on RNA-seq data was validated by comparing this approach to a so-called "per-sample" method.In addition,pair-wise comparisons of the two methods were performed to evaluate their respective sensitivity,precision and accuracy using DNA genotypes from a companion study including the same 50 cows genotyped using either genotyping-by-sequencing or with the Bovine SNP50 Beadchip(imputed to the Bovine high density).Results indicate that both approaches are very close in their capacity of detecting reference variants and that the joint genotyping method is more sensitive than the per-sample method.Given that the joint genotyping method is more flexible and technically easier,we recommend this approach for variant calling in RNA-seq experiments.
文摘Slow speed of the Next-Generation sequencing data analysis, compared to the latest high throughput sequencers such as HiSeq X system, using the current industry standard genome analysis pipeline, has been the major factor of data backlog which limits the real-time use of genomic data for precision medicine. This study demonstrates the DRAGEN Bio-IT Processor as a potential candidate to remove the “Big Data Bottleneck”. DRAGENTM accomplished the variant calling, for ~40× coverage WGS data in as low as ~30 minutes using a single command, achieving the over 50-fold data analysis speed while maintaining the similar or better variant calling accuracy than the standard GATK Best Practices workflow. This systematic comparison provides the faster and efficient NGS data analysis alternative to NGS-based healthcare industries and research institutes to meet the requirement for precision medicine based healthcare.
基金supported by Zhejiang Provincial Natural Science Foundation(LY20H030006)Key Research&Development Program of Zhejiang(2023C03045)+2 种基金Fundamental Research Funds for the Central Universities(2022ZFJH003)Jinan Microecological Biomedicine Shandong Laboratory(JNL-2022036C)Public Welfare Project of Jinhua City,Zhejiang(2021-4-359).
文摘Bacterial genome sequencing is a powerful technique for studying the genetic diversity and evolution ofmicrobial populations.However,the detection of genomic variants from sequencing data is challenging due to the presence of contamination,sequencing errors and multiple strains within the same species.Several bioinformatics tools have been developed to address these issues,but their performance and accuracy have not been systematically evaluated.In this study,we compared 10 variant detection pipelines using 18 simulated and 17 real datasets of high-throughput sequences froma bundle of representative bacteria.We assessed the sensitivity of each pipeline under different conditions of coverage,simulation and strain diversity.We also demonstrated the application of these tools to identify consistentmutations in a 30-time repeated sequencing dataset of Staphylococcus hominis.We found that HaplotypeCaller,but not Mutect2,from the GATK tool set showed the best performance in terms of accuracy and robustness.CFSAN and Snippy performed not as well in several simulated and real sequencing datasets.Our results provided a comprehensive benchmark and guidance for choosing the optimal variant detection pipeline for high-throughput bacterial genome sequencing data.