In protein sequence classification research, it is popular to convert a variable length sequence of protein into a fixed length numerical vector by using various descriptors, for instance, composition of k-mer composi...In protein sequence classification research, it is popular to convert a variable length sequence of protein into a fixed length numerical vector by using various descriptors, for instance, composition of k-mer composition. Such position-independent descriptors are useful since they are applicable to any length of sequence;however, positional information of subsequence is discarded even though it might have high contribution to classification performance. To solve this problem, we divided the original sequence into some segments, and then calculated the numerical features for them. It enables us to partially introduce positional information (for instance, compositions of serine in anterior and posterior segments of a sequence). Through comprehensive experiments on the number of segments and length of overlapping region, we found our classification approach with sequence segmentation and feature selection is effective to improve the performance. We evaluated our approach on three protein classification problems and achieved significant improvement in all cases which have a dataset with sufficient amino acid in each sequence. This result has shown the great potential of using additional segments in protein sequence classification to solve other sequence problems in bioinformatics.展开更多
基于DNA序列的chaos game representation(CGR)图形表示,文中给出了一种新的蛋白质序列的3D空间表示,其空间曲线的x、y轴坐标由Randic文章中的方法得到,z轴坐标由蛋白质序列中氨基酸的累积个数得到。蛋白质序列的3D空间表示得到的曲线比...基于DNA序列的chaos game representation(CGR)图形表示,文中给出了一种新的蛋白质序列的3D空间表示,其空间曲线的x、y轴坐标由Randic文章中的方法得到,z轴坐标由蛋白质序列中氨基酸的累积个数得到。蛋白质序列的3D空间表示得到的曲线比Randic的2D图形表示具有更好的可视性。利用不同的数学方法,对所得到的曲线进行了数值刻画;基于得到的数值刻画,比较了9个不同物种的线粒体NADH脱氢酶的相似性。展开更多
文摘In protein sequence classification research, it is popular to convert a variable length sequence of protein into a fixed length numerical vector by using various descriptors, for instance, composition of k-mer composition. Such position-independent descriptors are useful since they are applicable to any length of sequence;however, positional information of subsequence is discarded even though it might have high contribution to classification performance. To solve this problem, we divided the original sequence into some segments, and then calculated the numerical features for them. It enables us to partially introduce positional information (for instance, compositions of serine in anterior and posterior segments of a sequence). Through comprehensive experiments on the number of segments and length of overlapping region, we found our classification approach with sequence segmentation and feature selection is effective to improve the performance. We evaluated our approach on three protein classification problems and achieved significant improvement in all cases which have a dataset with sufficient amino acid in each sequence. This result has shown the great potential of using additional segments in protein sequence classification to solve other sequence problems in bioinformatics.
文摘基于DNA序列的chaos game representation(CGR)图形表示,文中给出了一种新的蛋白质序列的3D空间表示,其空间曲线的x、y轴坐标由Randic文章中的方法得到,z轴坐标由蛋白质序列中氨基酸的累积个数得到。蛋白质序列的3D空间表示得到的曲线比Randic的2D图形表示具有更好的可视性。利用不同的数学方法,对所得到的曲线进行了数值刻画;基于得到的数值刻画,比较了9个不同物种的线粒体NADH脱氢酶的相似性。