Due to current technology enhancement,molecular databases have exponentially grown requesting faster efficient methods that can handle these amounts of huge data.There-fore,Multi-processing CPUs technology can be used...Due to current technology enhancement,molecular databases have exponentially grown requesting faster efficient methods that can handle these amounts of huge data.There-fore,Multi-processing CPUs technology can be used including physical and logical processors(Hyper Threading)to significantly increase the performance of computations.Accordingly,sequence comparison and pairwise alignment were both found contributing significantly in calculating the resemblance between sequences for constructing optimal alignments.This research used the Hash Table-NGram-Hirschberg(HT-NGH)algo-rithm to represent this pairwise alignment utilizing hashing capabilities.The authors propose using parallel shared memory architecture via Hyper Threading to improve the performance of molecular dataset protein pairwise alignment.The proposed parallel hyper threading method targeted the transformation of the HT-NGH on the datasets decomposition for sequence level efficient utilization within the processing units,that is,reducing idle processing unit situations.The authors combined hyper threading within the multicore architecture processing on shared memory utilization remarking perfor-mance of 24.8%average speed up to 34.4%as the highest boosting rate.The benefit of this work improvement is shown preserving acceptable accuracy,that is,reaching 2.08,2.88,and 3.87 boost-up as well as the efficiency of 1.04,0.96,and 0.97,using 2,3,and 4 cores,respectively,as attractive remarkable results.展开更多
Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research.Although classic dynamic programming(DP)algorithms(e.g.,Smith–Waterman and Needleman–Wunsch)guarantee to produc...Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research.Although classic dynamic programming(DP)algorithms(e.g.,Smith–Waterman and Needleman–Wunsch)guarantee to produce the optimal result,their time complexity hinders the application of large-scale sequence alignment.Many optimization efforts that aim to accelerate the alignment process generally come from three perspectives:redesigning data structures[e.g.,diagonal or striped Single Instruction Multiple Data(SIMD)implementations],increasing the number of parallelisms in SIMD operations(e.g.,difference recurrence relation),or reducing search space(e.g.,banded DP).However,no methods combine all these three aspects to build an ultra-fast algorithm.In this study,we developed a Banded Striped Aligner(BSAlign)library that delivers accurate alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives with highlights such as active F-loop in striped vectorization and striped move in banded DP.We applied our new acceleration design on both regular and edit distance pairwise alignment.BSAlign achieved 2-fold speed-up than other SIMD-based implementations for regular pairwise alignment,and 1.5-fold to 4-fold speed-up in edit distance-based implementations for long reads.BSAlign is implemented in C programing language and is available at https://github.com/ruanjue/bsalign.展开更多
Identification of the splice sites is a critical and tough issue in eukaryotic genome annotation. Here, a statistical study is introduced for detecting the splicing signals in the human hemoglobin (Hb) pre-mRNAs by ...Identification of the splice sites is a critical and tough issue in eukaryotic genome annotation. Here, a statistical study is introduced for detecting the splicing signals in the human hemoglobin (Hb) pre-mRNAs by using the approaches of regional pairwise alignment, splicing weight matrix scoring, and dynamic extended folding. First, the regional pairwise alignment results show that the coding regions of the human Hb genes are at a high level for both conservation and fluctuation. Second, the weighted matrix scoring results indicate that, although the authentic splicing motifs are always scored the highest in a sequence, the sequence motif alone is inadequate to precisely define the splice sites. Finally, we deduce the RNA frame structures by applying an extended folding approach to analyze the stable folding elements. We find out that the splice sequences tend to take stretching and partially paired conformations, which benefit recognition and competitive binding of the splicing factors. These results indicate that precise splicing is an integrated effect of multiple mechanisms of signal recognition at the level of sequence and structure.展开更多
The study of nucleotide substitution is very important both to our understanding of gene evolution and to reliable estimation of phylogenetic relationships. In this paper nucleotide substitution is assumed to be ran...The study of nucleotide substitution is very important both to our understanding of gene evolution and to reliable estimation of phylogenetic relationships. In this paper nucleotide substitution is assumed to be random and the Markov model is applied to the study of the evolution of genes. Then a non linear optimization approach is proposed for estimating substitution in real sequences. This substitution is called the 'Nucleotide State Transfer Matrix'. One of the most important conclusions from this work is that gene sequence evolution conforms to the Markov process. Also, some theoretical evidences for random evolution are given from energy analysis of DNA replication.展开更多
基金Deanship of Scientific Research(DSR),King Abdulaziz University,Grant/Award Number:D-139-137-1441。
文摘Due to current technology enhancement,molecular databases have exponentially grown requesting faster efficient methods that can handle these amounts of huge data.There-fore,Multi-processing CPUs technology can be used including physical and logical processors(Hyper Threading)to significantly increase the performance of computations.Accordingly,sequence comparison and pairwise alignment were both found contributing significantly in calculating the resemblance between sequences for constructing optimal alignments.This research used the Hash Table-NGram-Hirschberg(HT-NGH)algo-rithm to represent this pairwise alignment utilizing hashing capabilities.The authors propose using parallel shared memory architecture via Hyper Threading to improve the performance of molecular dataset protein pairwise alignment.The proposed parallel hyper threading method targeted the transformation of the HT-NGH on the datasets decomposition for sequence level efficient utilization within the processing units,that is,reducing idle processing unit situations.The authors combined hyper threading within the multicore architecture processing on shared memory utilization remarking perfor-mance of 24.8%average speed up to 34.4%as the highest boosting rate.The benefit of this work improvement is shown preserving acceptable accuracy,that is,reaching 2.08,2.88,and 3.87 boost-up as well as the efficiency of 1.04,0.96,and 0.97,using 2,3,and 4 cores,respectively,as attractive remarkable results.
基金supported by the National Natural Science Foundation of China(Grant Nos.31822029 and 32200517)the National Key R&D Project Program of China(Grant No.2019YFE0109600).
文摘Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research.Although classic dynamic programming(DP)algorithms(e.g.,Smith–Waterman and Needleman–Wunsch)guarantee to produce the optimal result,their time complexity hinders the application of large-scale sequence alignment.Many optimization efforts that aim to accelerate the alignment process generally come from three perspectives:redesigning data structures[e.g.,diagonal or striped Single Instruction Multiple Data(SIMD)implementations],increasing the number of parallelisms in SIMD operations(e.g.,difference recurrence relation),or reducing search space(e.g.,banded DP).However,no methods combine all these three aspects to build an ultra-fast algorithm.In this study,we developed a Banded Striped Aligner(BSAlign)library that delivers accurate alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives with highlights such as active F-loop in striped vectorization and striped move in banded DP.We applied our new acceleration design on both regular and edit distance pairwise alignment.BSAlign achieved 2-fold speed-up than other SIMD-based implementations for regular pairwise alignment,and 1.5-fold to 4-fold speed-up in edit distance-based implementations for long reads.BSAlign is implemented in C programing language and is available at https://github.com/ruanjue/bsalign.
基金Supported by the National Natural Science Foundation of China (30971454, 9030318, and 90208018)
文摘Identification of the splice sites is a critical and tough issue in eukaryotic genome annotation. Here, a statistical study is introduced for detecting the splicing signals in the human hemoglobin (Hb) pre-mRNAs by using the approaches of regional pairwise alignment, splicing weight matrix scoring, and dynamic extended folding. First, the regional pairwise alignment results show that the coding regions of the human Hb genes are at a high level for both conservation and fluctuation. Second, the weighted matrix scoring results indicate that, although the authentic splicing motifs are always scored the highest in a sequence, the sequence motif alone is inadequate to precisely define the splice sites. Finally, we deduce the RNA frame structures by applying an extended folding approach to analyze the stable folding elements. We find out that the splice sequences tend to take stretching and partially paired conformations, which benefit recognition and competitive binding of the splicing factors. These results indicate that precise splicing is an integrated effect of multiple mechanisms of signal recognition at the level of sequence and structure.
文摘The study of nucleotide substitution is very important both to our understanding of gene evolution and to reliable estimation of phylogenetic relationships. In this paper nucleotide substitution is assumed to be random and the Markov model is applied to the study of the evolution of genes. Then a non linear optimization approach is proposed for estimating substitution in real sequences. This substitution is called the 'Nucleotide State Transfer Matrix'. One of the most important conclusions from this work is that gene sequence evolution conforms to the Markov process. Also, some theoretical evidences for random evolution are given from energy analysis of DNA replication.