High-throughput SNP genotyping platforms use automated genotype calling algorithms to assign genotypes. While these algorithms work efficiently for individual platforms, they are not compatible with other platforms, a...High-throughput SNP genotyping platforms use automated genotype calling algorithms to assign genotypes. While these algorithms work efficiently for individual platforms, they are not compatible with other platforms, and have individual biases that result in missed genotype calls. Here we present data on the use of a second complementary SNP genotype clustering algorithm. The algorithm was originally designed for individual fluorescent SNP genotyping assays, and has been optimized to permit the clustering of large datasets generated from custom-designed Affymetrix SNP panels. In an analysis of data from a 3K array genotyped on 1,560 samples, the additional analysis increased the overall number of genotypes by over 45,000, significantly improving the completeness of the experimental data. This analysis suggests that the use of multiple genotype calling algorithms may be advisable in high-throughput SNP genotyping experiments. The software is written in Perl and is available from the corresponding author.展开更多
个体单体型MSR(minimum SNP removal)问题是指如何利用个体的基因测序片断数据去掉最少的SNP(single-nucleotide polymorphisms)位点,以确定该个体单体型的计算问题.对此问题,Bafna等人提出了时间复杂度为O(2~kn^2m)的算法,其中,m为DNA...个体单体型MSR(minimum SNP removal)问题是指如何利用个体的基因测序片断数据去掉最少的SNP(single-nucleotide polymorphisms)位点,以确定该个体单体型的计算问题.对此问题,Bafna等人提出了时间复杂度为O(2~kn^2m)的算法,其中,m为DNA片断总数,n为SNP位点总数,k为片断中洞(片断中的空值位点)的个数.由于一个Mate-Pair片段中洞的个数可以达到100,因此,在片段数据中有Mate-Pair的情况下,Bafna的算法通常是不可行的.根据片段数据的特点提出了一个时间复杂度为O((n-1)(k_1-1)k_22^(2h)+(k_1+1)^(2h)+nk_2+mk_1)的新算法,其中,k_1为一个片断覆盖的最大SNP位点数(不大于n),k_2为覆盖同一SNP位点的片段的最大数(通常不大于19),h为覆盖同一SNP位点且在该位点取空值的片断的最大数(不大于k_2).该算法的时间复杂度与片断中洞的个数的最大值k没有直接的关系,在有Mate-Pair片断数据的情况下仍然能够有效地进行计算,具有良好的可扩展性和较高的实用价值.展开更多
根据 DNA 测序片段数据的特点,提出了一个时间复杂度为 O(nk_22^(k_2)+mlogm+mk_1)的单体型组装问题 MEC/GI 模型的参数化算法,其中 m 为片段数,n 为单体型的 SNP位点数,k_1 为一个片段覆盖的最大 SNP 位点数(通常小于10),k_2为覆盖同一...根据 DNA 测序片段数据的特点,提出了一个时间复杂度为 O(nk_22^(k_2)+mlogm+mk_1)的单体型组装问题 MEC/GI 模型的参数化算法,其中 m 为片段数,n 为单体型的 SNP位点数,k_1 为一个片段覆盖的最大 SNP 位点数(通常小于10),k_2为覆盖同一 SNP 位点的片段的最大数(通常不大于10)。对于实际 DNA 测序中的片段数据,即使 m 和 n 都相当大,该算法也可以在较短的时间得到 MEC/GI 模型的精确解,具有良好的可扩展性和较高的实用价值。展开更多
Copy number variants (CNVs) are pieces of genomic DNA of 1000 base pairs or longer which occur in a given genome at a different frequency than in a reference genome. Their importance as a source for phenotypic variabi...Copy number variants (CNVs) are pieces of genomic DNA of 1000 base pairs or longer which occur in a given genome at a different frequency than in a reference genome. Their importance as a source for phenotypic variability has been recognized only in the last couple of years. Chromosomal deletions can be seen as a special case of CNVs where stretches of DNA are missing in certain lines when compared to the reference genome of the mouse line C57BL/6, for example. Based upon more than 8 million single nucleotide polymorphisms (SNPs) in the fifteen inbred mouse lines which were determined in a whole genome chip based resequencing project by Perlegen Sciences, we detected 20166 such long chromosomal deletions. They cover altogether between 4.4 million and 8.8 million base pairs, depending on the mouse line. Thus, their extent is comparable to that of SNPs. The chromosomal deletions were found by searching for clusters of missing values in the genotyping data by applying bioinformatics and biostatistical methods. In contrast to isolated missing values, clusters are likely the consequence of missing DNA probe rather than of a failed hybridization or deficient oligos. We analyzed these deletion sites in various ways. Twenty-two percent of these deletion sites overlap with exons; they could therefore affect a gene's functioning. The corresponding genes seem to exist in alternative forms, a phenomenon that reminds of the alternative forms of mRNA generated during gene splicing. We furthermore detected statistically significant association between hundreds of deletion sites and fat weight at the age of eight weeks.展开更多
文摘High-throughput SNP genotyping platforms use automated genotype calling algorithms to assign genotypes. While these algorithms work efficiently for individual platforms, they are not compatible with other platforms, and have individual biases that result in missed genotype calls. Here we present data on the use of a second complementary SNP genotype clustering algorithm. The algorithm was originally designed for individual fluorescent SNP genotyping assays, and has been optimized to permit the clustering of large datasets generated from custom-designed Affymetrix SNP panels. In an analysis of data from a 3K array genotyped on 1,560 samples, the additional analysis increased the overall number of genotypes by over 45,000, significantly improving the completeness of the experimental data. This analysis suggests that the use of multiple genotype calling algorithms may be advisable in high-throughput SNP genotyping experiments. The software is written in Perl and is available from the corresponding author.
基金Supported by the National Natural Science Foundation of China under Grant No.60433020(国家自然科学基金)the Program for New Century Excellent Talents in University of China under Grant No.NCET-05-0683(新世纪优秀人才支持计划)+1 种基金the Program for Changjiang Scholars and Innovative Research Team in University of China under Grant No.IRT0661(国家教育部创新团队资助项目)the Scientific Research Fund of Hunan Provincial Education Department of China under Grant No.06C52(湖南省教育厅资助科研项目)
文摘个体单体型MSR(minimum SNP removal)问题是指如何利用个体的基因测序片断数据去掉最少的SNP(single-nucleotide polymorphisms)位点,以确定该个体单体型的计算问题.对此问题,Bafna等人提出了时间复杂度为O(2~kn^2m)的算法,其中,m为DNA片断总数,n为SNP位点总数,k为片断中洞(片断中的空值位点)的个数.由于一个Mate-Pair片段中洞的个数可以达到100,因此,在片段数据中有Mate-Pair的情况下,Bafna的算法通常是不可行的.根据片段数据的特点提出了一个时间复杂度为O((n-1)(k_1-1)k_22^(2h)+(k_1+1)^(2h)+nk_2+mk_1)的新算法,其中,k_1为一个片断覆盖的最大SNP位点数(不大于n),k_2为覆盖同一SNP位点的片段的最大数(通常不大于19),h为覆盖同一SNP位点且在该位点取空值的片断的最大数(不大于k_2).该算法的时间复杂度与片断中洞的个数的最大值k没有直接的关系,在有Mate-Pair片断数据的情况下仍然能够有效地进行计算,具有良好的可扩展性和较高的实用价值.
文摘根据 DNA 测序片段数据的特点,提出了一个时间复杂度为 O(nk_22^(k_2)+mlogm+mk_1)的单体型组装问题 MEC/GI 模型的参数化算法,其中 m 为片段数,n 为单体型的 SNP位点数,k_1 为一个片段覆盖的最大 SNP 位点数(通常小于10),k_2为覆盖同一 SNP 位点的片段的最大数(通常不大于10)。对于实际 DNA 测序中的片段数据,即使 m 和 n 都相当大,该算法也可以在较短的时间得到 MEC/GI 模型的精确解,具有良好的可扩展性和较高的实用价值。
基金Project supported by the German Ministry of Education and Research (BMBF) through the National Genome Research Network(NGFN) (Nos. 01GS0486 and 01GR0460)the DeutscheForschungsgemeinschaft (DFG) for a Travel Grant to Armin O.Schmitt
文摘Copy number variants (CNVs) are pieces of genomic DNA of 1000 base pairs or longer which occur in a given genome at a different frequency than in a reference genome. Their importance as a source for phenotypic variability has been recognized only in the last couple of years. Chromosomal deletions can be seen as a special case of CNVs where stretches of DNA are missing in certain lines when compared to the reference genome of the mouse line C57BL/6, for example. Based upon more than 8 million single nucleotide polymorphisms (SNPs) in the fifteen inbred mouse lines which were determined in a whole genome chip based resequencing project by Perlegen Sciences, we detected 20166 such long chromosomal deletions. They cover altogether between 4.4 million and 8.8 million base pairs, depending on the mouse line. Thus, their extent is comparable to that of SNPs. The chromosomal deletions were found by searching for clusters of missing values in the genotyping data by applying bioinformatics and biostatistical methods. In contrast to isolated missing values, clusters are likely the consequence of missing DNA probe rather than of a failed hybridization or deficient oligos. We analyzed these deletion sites in various ways. Twenty-two percent of these deletion sites overlap with exons; they could therefore affect a gene's functioning. The corresponding genes seem to exist in alternative forms, a phenomenon that reminds of the alternative forms of mRNA generated during gene splicing. We furthermore detected statistically significant association between hundreds of deletion sites and fat weight at the age of eight weeks.