sequences found in the huge,integrated database of protein sequences(Big Fantastic Database).In contrast,the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search.Here,we b...sequences found in the huge,integrated database of protein sequences(Big Fantastic Database).In contrast,the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search.Here,we built a comprehensive database by incorporating the non-coding RNA(ncRNA)sequences from RNAcentral,the transcriptome assembly and metagenome assembly from metagenomics RAST(MG-RAST),the genomic sequences from Genome Warehouse(GWH),and the genomic sequences from MGnify,in addition to the nucleotide(nt)database and its subsets in National Center of Biotechnology Information(NCBI).The resulting Master database of All possible RNA sequences(MARS)is 20-fold larger than NCBI’s nt database or 60-fold larger than RNAcentral.The new dataset along with a new split-search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques.It also yields more accurate and more sensitive multiple sequence alignments(MSAs)than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam.The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs.MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037,and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.展开更多
基金supported by grants from the National Key R&D Program of China(Grant No.2021YFF1200400)the Major Program of Shenzhen Bay Laboratory,China(Grant No.S201101001)+1 种基金the Shenzhen Science and Technology Innovation Program,China(Grant No.KQTD20170330155106581)the Griffith University Postgraduate Fellowship,Australia.
文摘sequences found in the huge,integrated database of protein sequences(Big Fantastic Database).In contrast,the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search.Here,we built a comprehensive database by incorporating the non-coding RNA(ncRNA)sequences from RNAcentral,the transcriptome assembly and metagenome assembly from metagenomics RAST(MG-RAST),the genomic sequences from Genome Warehouse(GWH),and the genomic sequences from MGnify,in addition to the nucleotide(nt)database and its subsets in National Center of Biotechnology Information(NCBI).The resulting Master database of All possible RNA sequences(MARS)is 20-fold larger than NCBI’s nt database or 60-fold larger than RNAcentral.The new dataset along with a new split-search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques.It also yields more accurate and more sensitive multiple sequence alignments(MSAs)than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam.The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs.MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037,and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.