Transformer-based models like large language models(LLMs)have attracted significant attention in recent years due to their superior performance.A long sequence of input tokens is essential for industrial LLMs to provi...Transformer-based models like large language models(LLMs)have attracted significant attention in recent years due to their superior performance.A long sequence of input tokens is essential for industrial LLMs to provide better user services.However,memory consumption increases quadratically with the increase of sequence length,posing challenges for scaling up long-sequence training.Current parallelism methods produce duplicated tensors during execution,leaving space for improving memory efficiency.Additionally,tensor parallelism(TP)cannot achieve effective overlap between computation and communication.To solve these weaknesses,we propose a general parallelism method called memory-efficient tensor parallelism(METP),designed for the computation of two consecutive matrix multiplications and a possible function between them(O=f(AB)C),which is the kernel computation component in Transformer training.METP distributes subtasks of computing O to multiple devices and uses send/recv instead of collective communication to exchange submatrices for finishing the computation,avoiding producing duplicated tensors.We also apply the double buffering technique to achieve better overlap between computation and communication.We present the theoretical condition of full overlap to help instruct the long-sequence training of Transformers.Suppose the parallel degree is p;through theoretical analysis,we prove that METP provides O(1/p^(3))memory overhead when not using FlashAttention to compute attention and could save at least 41.7%memory compared to TP when using FlashAttention to compute multi-head self-attention.Our experimental results demonstrate that METP can increase the sequence length by 2.38–2.99 times compared to other methods when using eight A100 graphics processing units(GPUs).展开更多
Long noncoding RNAs(lncRNAs)are crucial in gene regulation,chromatin architecture,and cellular differentiation,playing significant roles in various diseases and serving as potential biomarkers and therapeutic targets....Long noncoding RNAs(lncRNAs)are crucial in gene regulation,chromatin architecture,and cellular differentiation,playing significant roles in various diseases and serving as potential biomarkers and therapeutic targets.Understanding their precise subcellular localization is essential for elucidating their functions in biological pathways.Current methods for predicting lncRNA subcellular localization face challenges in capturing long-range interactions within sequences.Deep learning models often struggle with feature extraction that adequately represents these distant dependencies,leading to limited predictive accuracy.We develop Loc4Lnc,a deep learning framework for predicting lncRNA subcellular localization.The model integrates convolutional layers and transformer blocks to effectively capture both local sequence motifs and long-range dependencies within RNA sequences,followed by classification using TextCNN.Using the RNALocate v2.0 database,we constructed a benchmark dataset covering five subcellular locations(cytoplasm,nucleus,cytosol,chromatin,and exosome).The performance of the model is evaluated against existing feature extraction methods and existing predictors.Results of the Loc4Lnc study demonstrate significant improvements in predicting lncRNA subcellular localization.The model achieved a prediction accuracy of 0.636 on an independent test set,outperforming existing methodologies.Comparative evaluations showed that it consistently surpassed traditional feature extraction methods and state-of-the-art predictors,highlighting its robustness and effectiveness in accurately classifying lncRNAs across five distinct subcellular locations.Loc4Lnc effectively captures long-range interactions and optimizes information flow between distal elements,providing an effective predictive tool for the subcellular localization of lncRNAs and laying the foundation for future research on the regulation of gene expression and cellular functions by lncRNAs.展开更多
Otoancorin(OTOA)is a glycosylphosphatidylinositol(GPI)-anchored protein mediating the attachment of the tectorial membrane(TM)to the spiral limbus(SL)in the inner ear.Homozygous or compound heterozygous mutations in O...Otoancorin(OTOA)is a glycosylphosphatidylinositol(GPI)-anchored protein mediating the attachment of the tectorial membrane(TM)to the spiral limbus(SL)in the inner ear.Homozygous or compound heterozygous mutations in OTOA cause autosomal recessive deafness(DFNB22).We performed short-read exome sequencing(SRS)in a 10-monthold boy with sensorineural hearing loss,identifying a potential p.Glu787*variant in OTOA.Interestingly,this variant is common among normal-hearing individuals,leading us to question its pathogenic potential.展开更多
Diffusion tensor imaging plays an important role in the accurate diagnosis and prognosis of spinal cord diseases. However, because of technical limitations, the imaging sequences used in this technique cannot reveal t...Diffusion tensor imaging plays an important role in the accurate diagnosis and prognosis of spinal cord diseases. However, because of technical limitations, the imaging sequences used in this technique cannot reveal the fine structure of the spinal cord with precision. We used the readout segmentation of long variable echo-trains(RESOLVE) sequence in this cross-sectional study of 45 healthy volunteers aged 20 to 63 years. We found that the RESOLVE sequence significantly increased the resolution of the diffusion images and improved the median signal-to-noise ratio of the middle(C4–6) and lower(C7–T1) cervical segments to the level of the upper cervical segment. In addition, the values of fractional anisotropy and radial diffusivity were significantly higher in white matter than in gray matter. Our study verified that the RESOLVE sequence could improve resolution of diffusion tensor imaging in clinical applications and provide accurate baseline data for the diagnosis and treatment of cervical spinal cord diseases.展开更多
The rapid advancement of nanopore metagenomic sequencing has revolutionized microbiome research,enabling significant breakthroughs in the study of microbial communities,ecological dynamics,and genome-level functions[1...The rapid advancement of nanopore metagenomic sequencing has revolutionized microbiome research,enabling significant breakthroughs in the study of microbial communities,ecological dynamics,and genome-level functions[1–5].This innovative sequencing technology stands out due to its ability to generate long sequencing reads,which are pivotal in resolving complex microbial genomic structures that short-read sequencing often fails to address[2,6].展开更多
The journal Genomics,Proteomics&Bioinformatics(GPB)is inviting submissions for a special issue(to be published in the Spring of 2026)on the topic of"Long-read Sequencing".
The journal Genomics,Proteomics&Bioinformatics(GPB)is inviting submissions for a special issue(to be published in the Spring of 2026)on the topic of“Long-read Sequencing”.Long-read sequencing(LRS)technologies ar...The journal Genomics,Proteomics&Bioinformatics(GPB)is inviting submissions for a special issue(to be published in the Spring of 2026)on the topic of“Long-read Sequencing”.Long-read sequencing(LRS)technologies are revolutionizing the field of genomics by providing unprecedented insights into genome architecture and function.展开更多
High-throughput sequencing has identified a large number of sense-antisense transcriptional pairs, which indicates that these genes were transcribed from both directions. Recent reports have demonstrated that many ant...High-throughput sequencing has identified a large number of sense-antisense transcriptional pairs, which indicates that these genes were transcribed from both directions. Recent reports have demonstrated that many antisense RNAs, especially lnc RNA(long non-coding RNA), can interact with the sense RNA by forming an RNA duplex. Many methods, such as RNA-sequencing, Northern blotting, RNase protection assays and strand-specific PCR, can be used to detect the antisense transcript and gene transcriptional orientation. However, the applications of these methods have been constrained, to some extent, because of the high cost, difficult operation or inaccuracy, especially regarding the analysis of substantial amounts of data. Thus, we developed an easy method to detect and validate these complicated RNAs. We primarily took advantage of the strand specificity of RT-PCR and the single-strand specificity of S1 endonuclease to analyze sense and antisense transcripts. Four known genes, including mouse β-actin and Tsix(Xist antisense RNA), chicken LXN(latexin) and GFM1(Gelongation factor, mitochondrial 1), were used to establish the method. These four genes were well studied and transcribed from positive strand, negative strand or both strands of DNA, respectively, which represented all possible cases. The results indicated that the method can easily distinguish sense, antisense and sense-antisense transcriptional pairs. In addition, it can be used to verify the results of high-throughput sequencing, as well as to analyze the regulatory mechanisms between RNAs. This method can improve the accuracy of detection and can be mainly used in analyzing single gene and was low cost.展开更多
Aberrant RNA alternative splicing in cancer generates varied novel isoforms and protein variants that facilitate cancer progression.Here,we employed the advanced long-read full-length transcriptome sequencing on gallb...Aberrant RNA alternative splicing in cancer generates varied novel isoforms and protein variants that facilitate cancer progression.Here,we employed the advanced long-read full-length transcriptome sequencing on gallbladder normal tissues,tumors,and cell lines to establish a comprehensive full-length gallbladder transcriptomic atlas.It is of note that receptor tyrosine kinases were one of the most dynamic components with highly variable transcript,with Erb-B2 receptor tyrosine kinase 2(ERBB2)as a prime representative.A novel transcript,designated ERBB2 i14e,was identified for encoding a novel functional protein,and its protein expression was elevated in gallbladder cancer and strongly associated with worse prognosis.With the regulation of splicing factors ESRP1/2,ERBB2 i14e was alternatively spliced from intron 14 and the encoded i14e peptide was proved to facilitate the interaction with ERBB3 and downstream signaling activation of AKT.ERBB2 i14e was inducible and its expression attenuated anti-ERBB2 treatment efficacy in tumor xenografts.Further studies with patient derived xenografts models validated that ERBB2 i14e blockage with antisense oligonucleotide enhanced the tumor sensitivity to trastuzumab and its drug conjugates.Overall,this study provides a gallbladder specific long-read transcriptome profile and discovers a novel mechanism of trastuzumab resistance,thus ultimately devising strategies to improve trastuzumab therapy.展开更多
The journal Genomics,Proteomics&Bioinformatics(GPB)is inviting submissions for a special issue(to be published in the Spring of 2026)on the topic of"Long-read Sequencing".Long-read sequencing(LRS)technol...The journal Genomics,Proteomics&Bioinformatics(GPB)is inviting submissions for a special issue(to be published in the Spring of 2026)on the topic of"Long-read Sequencing".Long-read sequencing(LRS)technologies are revolutionizing the field of genomics by providing unprecedented insights into genome architecture and function.展开更多
Over 17 and 160 types of chemical modifications have been identified in DNA and RNA,respectively.The interest in understanding the various biological functions of DNA and RNA modifications has lead to the cutting-edge...Over 17 and 160 types of chemical modifications have been identified in DNA and RNA,respectively.The interest in understanding the various biological functions of DNA and RNA modifications has lead to the cutting-edged fields of epigenomics and epitranscriptomics.Developing chemical and biological tools to detect specific modifications in the genome or transcriptome has greatly facilitated their study.Here,we review the recent technological advances in this rapidly evolving field.We focus on high-throughput detection methods and biological findings for these modifications,and discuss questions to be addressed as well.We also summarize third-generation sequencing methods,which enable long-read and single-molecule sequencing of DNA and RNA modification.展开更多
The deformability and high degree of freedom of mollusks bring challenges in mathematical modeling and synthesis of motions.Traditional analytical and statistical models are limited by either rigid skeleton assumption...The deformability and high degree of freedom of mollusks bring challenges in mathematical modeling and synthesis of motions.Traditional analytical and statistical models are limited by either rigid skeleton assumptions or model capacity,and have difficulty in generating realistic and multi-pattern mollusk motions.In this work,we present a large-scale dynamic pose dataset of Drosophila larvae and propose a motion synthesis model named Path2Pose to generate a pose sequence given the initial poses and the subsequent guiding path.The Path2Pose model is further used to synthesize long pose sequences of various motion patterns through a recursive generation method.Evaluation analysis results demonstrate that our novel model synthesizes highly realistic mollusk motions and achieves state-of-the-art performance.Our work proves high performance of deep neural networks for mollusk motion synthesis and the feasibility of long pose sequence synthesis based on the customized body shape and guiding path.展开更多
基金Project supported by the National Natural Science Foundation of China(Nos.62025208 and 62421002)。
文摘Transformer-based models like large language models(LLMs)have attracted significant attention in recent years due to their superior performance.A long sequence of input tokens is essential for industrial LLMs to provide better user services.However,memory consumption increases quadratically with the increase of sequence length,posing challenges for scaling up long-sequence training.Current parallelism methods produce duplicated tensors during execution,leaving space for improving memory efficiency.Additionally,tensor parallelism(TP)cannot achieve effective overlap between computation and communication.To solve these weaknesses,we propose a general parallelism method called memory-efficient tensor parallelism(METP),designed for the computation of two consecutive matrix multiplications and a possible function between them(O=f(AB)C),which is the kernel computation component in Transformer training.METP distributes subtasks of computing O to multiple devices and uses send/recv instead of collective communication to exchange submatrices for finishing the computation,avoiding producing duplicated tensors.We also apply the double buffering technique to achieve better overlap between computation and communication.We present the theoretical condition of full overlap to help instruct the long-sequence training of Transformers.Suppose the parallel degree is p;through theoretical analysis,we prove that METP provides O(1/p^(3))memory overhead when not using FlashAttention to compute attention and could save at least 41.7%memory compared to TP when using FlashAttention to compute multi-head self-attention.Our experimental results demonstrate that METP can increase the sequence length by 2.38–2.99 times compared to other methods when using eight A100 graphics processing units(GPUs).
基金National Key Research and Development Program of China,Grant/Award Number:2023YFC2811502National Natural Science Foundation of China,Grant/Award Numbers:62272300,62473257。
文摘Long noncoding RNAs(lncRNAs)are crucial in gene regulation,chromatin architecture,and cellular differentiation,playing significant roles in various diseases and serving as potential biomarkers and therapeutic targets.Understanding their precise subcellular localization is essential for elucidating their functions in biological pathways.Current methods for predicting lncRNA subcellular localization face challenges in capturing long-range interactions within sequences.Deep learning models often struggle with feature extraction that adequately represents these distant dependencies,leading to limited predictive accuracy.We develop Loc4Lnc,a deep learning framework for predicting lncRNA subcellular localization.The model integrates convolutional layers and transformer blocks to effectively capture both local sequence motifs and long-range dependencies within RNA sequences,followed by classification using TextCNN.Using the RNALocate v2.0 database,we constructed a benchmark dataset covering five subcellular locations(cytoplasm,nucleus,cytosol,chromatin,and exosome).The performance of the model is evaluated against existing feature extraction methods and existing predictors.Results of the Loc4Lnc study demonstrate significant improvements in predicting lncRNA subcellular localization.The model achieved a prediction accuracy of 0.636 on an independent test set,outperforming existing methodologies.Comparative evaluations showed that it consistently surpassed traditional feature extraction methods and state-of-the-art predictors,highlighting its robustness and effectiveness in accurately classifying lncRNAs across five distinct subcellular locations.Loc4Lnc effectively captures long-range interactions and optimizes information flow between distal elements,providing an effective predictive tool for the subcellular localization of lncRNAs and laying the foundation for future research on the regulation of gene expression and cellular functions by lncRNAs.
基金supported by the National Research Foundation of Korea(NRF)grant funded by the Korean government(MSIT)(No.2021R1C1C1007980 to B.J.K.)Chungnam National University Sejong Hospital Research Fund,2022,and Chungnam National University(to B.J.K.)+6 种基金supported by the Basic Science Research Program through the NRF,funded by the Ministry of Education(No.2021R1A2C2092038 to B.Y.C.)Bio Core Facility Center program(No.NRF-2022M3A9G1014007 to B.Y.C.)the Basic Research Laboratory program through the NRF,funded by the Ministry of Education(No.RS-2023-0021971031482092640001 to B.Y.C.)the Technology Innovation Program(No.K_G012002572001 to B.Y.C.)funded By the Ministry of Trade,Industry&Energy(MOTIE,Korea)funded by SNUBH(Seoul National University Bundang Hospital)intramural research fund(No.13-2022-0010,02-2017-0060,16-2023-0002,13-2023-0002,16-2022-0005,13-2024-0004,and 13-2017-0013 to B.Y.C.)supported by the National Institute on Deafness and Other Communication Disorders(NIDCD)part of the US National Institutes of Health(No.R01DC018814 to S.P.).
文摘Otoancorin(OTOA)is a glycosylphosphatidylinositol(GPI)-anchored protein mediating the attachment of the tectorial membrane(TM)to the spiral limbus(SL)in the inner ear.Homozygous or compound heterozygous mutations in OTOA cause autosomal recessive deafness(DFNB22).We performed short-read exome sequencing(SRS)in a 10-monthold boy with sensorineural hearing loss,identifying a potential p.Glu787*variant in OTOA.Interestingly,this variant is common among normal-hearing individuals,leading us to question its pathogenic potential.
文摘Diffusion tensor imaging plays an important role in the accurate diagnosis and prognosis of spinal cord diseases. However, because of technical limitations, the imaging sequences used in this technique cannot reveal the fine structure of the spinal cord with precision. We used the readout segmentation of long variable echo-trains(RESOLVE) sequence in this cross-sectional study of 45 healthy volunteers aged 20 to 63 years. We found that the RESOLVE sequence significantly increased the resolution of the diffusion images and improved the median signal-to-noise ratio of the middle(C4–6) and lower(C7–T1) cervical segments to the level of the upper cervical segment. In addition, the values of fractional anisotropy and radial diffusivity were significantly higher in white matter than in gray matter. Our study verified that the RESOLVE sequence could improve resolution of diffusion tensor imaging in clinical applications and provide accurate baseline data for the diagnosis and treatment of cervical spinal cord diseases.
文摘The rapid advancement of nanopore metagenomic sequencing has revolutionized microbiome research,enabling significant breakthroughs in the study of microbial communities,ecological dynamics,and genome-level functions[1–5].This innovative sequencing technology stands out due to its ability to generate long sequencing reads,which are pivotal in resolving complex microbial genomic structures that short-read sequencing often fails to address[2,6].
文摘The journal Genomics,Proteomics&Bioinformatics(GPB)is inviting submissions for a special issue(to be published in the Spring of 2026)on the topic of"Long-read Sequencing".
文摘The journal Genomics,Proteomics&Bioinformatics(GPB)is inviting submissions for a special issue(to be published in the Spring of 2026)on the topic of“Long-read Sequencing”.Long-read sequencing(LRS)technologies are revolutionizing the field of genomics by providing unprecedented insights into genome architecture and function.
基金supported by the National Natural Science Foundation of China(31301958)the Chinese Postdoctoral Science Foundation(2013T60808)
文摘High-throughput sequencing has identified a large number of sense-antisense transcriptional pairs, which indicates that these genes were transcribed from both directions. Recent reports have demonstrated that many antisense RNAs, especially lnc RNA(long non-coding RNA), can interact with the sense RNA by forming an RNA duplex. Many methods, such as RNA-sequencing, Northern blotting, RNase protection assays and strand-specific PCR, can be used to detect the antisense transcript and gene transcriptional orientation. However, the applications of these methods have been constrained, to some extent, because of the high cost, difficult operation or inaccuracy, especially regarding the analysis of substantial amounts of data. Thus, we developed an easy method to detect and validate these complicated RNAs. We primarily took advantage of the strand specificity of RT-PCR and the single-strand specificity of S1 endonuclease to analyze sense and antisense transcripts. Four known genes, including mouse β-actin and Tsix(Xist antisense RNA), chicken LXN(latexin) and GFM1(Gelongation factor, mitochondrial 1), were used to establish the method. These four genes were well studied and transcribed from positive strand, negative strand or both strands of DNA, respectively, which represented all possible cases. The results indicated that the method can easily distinguish sense, antisense and sense-antisense transcriptional pairs. In addition, it can be used to verify the results of high-throughput sequencing, as well as to analyze the regulatory mechanisms between RNAs. This method can improve the accuracy of detection and can be mainly used in analyzing single gene and was low cost.
基金supported by grants from National Natural Science Foundation of China(No.32130036,82403148,82303937,82073206)Shanghai Shenkang Clinical Technology Innovation Project(No.SHDC12021101)+8 种基金Basic Research Project of Science and Technology Commission of Shanghai Municipality(No.20JC1419100)National Key Research and Development Program of China(No.2021YFE0203300)Science and Technology Innovation Action Plan Technical Standards Project of Science and Technology Commission of Shanghai Municipality(23DZ2202800)Cooperative Research Projects of Shanghai Jiao Tong University(2022LHA13)Major Science and Technology R&D Project of the Science and Technology Department of Jiangxi Province(20213AAG01013)Shanghai Outstanding Academic Leader(23XD1450700),Shanghai Rising-Star Program(23QA1408500)Young Talents Project of Shanghai Municipal Health Commission(2022YQ061)Shanghai Municipal Health Commission health Industry clinical research special project(No.20224Z0014)the Shuguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission(No.20SG14).
文摘Aberrant RNA alternative splicing in cancer generates varied novel isoforms and protein variants that facilitate cancer progression.Here,we employed the advanced long-read full-length transcriptome sequencing on gallbladder normal tissues,tumors,and cell lines to establish a comprehensive full-length gallbladder transcriptomic atlas.It is of note that receptor tyrosine kinases were one of the most dynamic components with highly variable transcript,with Erb-B2 receptor tyrosine kinase 2(ERBB2)as a prime representative.A novel transcript,designated ERBB2 i14e,was identified for encoding a novel functional protein,and its protein expression was elevated in gallbladder cancer and strongly associated with worse prognosis.With the regulation of splicing factors ESRP1/2,ERBB2 i14e was alternatively spliced from intron 14 and the encoded i14e peptide was proved to facilitate the interaction with ERBB3 and downstream signaling activation of AKT.ERBB2 i14e was inducible and its expression attenuated anti-ERBB2 treatment efficacy in tumor xenografts.Further studies with patient derived xenografts models validated that ERBB2 i14e blockage with antisense oligonucleotide enhanced the tumor sensitivity to trastuzumab and its drug conjugates.Overall,this study provides a gallbladder specific long-read transcriptome profile and discovers a novel mechanism of trastuzumab resistance,thus ultimately devising strategies to improve trastuzumab therapy.
文摘The journal Genomics,Proteomics&Bioinformatics(GPB)is inviting submissions for a special issue(to be published in the Spring of 2026)on the topic of"Long-read Sequencing".Long-read sequencing(LRS)technologies are revolutionizing the field of genomics by providing unprecedented insights into genome architecture and function.
基金This work was supported by the National Natural Science Foundation of China(Grant No.31861143026 to C.Y.)the Ministry of Science and Technology of China(Grant Nos.2019YFA0110902 and 2019YFA08002501 to C.Y.)the Ludwig Institute for Cancer Research(C-X.S.),Cancer Research UK(C63763/A26394 and C63763/A27122 to C-X.S.)NIHR Oxford Biomedical Research Centre(to C-X.S.)and Emerson Collective(to C-X.S.).L-Y.Z.is supported by China Scholarship Council.The views expressed are those of the authors and not necessarily those of the NHS,the NIHR or the Department of Health.We apologize for not being able to cite all the publications related to this topic due to space constraints of the journal.
文摘Over 17 and 160 types of chemical modifications have been identified in DNA and RNA,respectively.The interest in understanding the various biological functions of DNA and RNA modifications has lead to the cutting-edged fields of epigenomics and epitranscriptomics.Developing chemical and biological tools to detect specific modifications in the genome or transcriptome has greatly facilitated their study.Here,we review the recent technological advances in this rapidly evolving field.We focus on high-throughput detection methods and biological findings for these modifications,and discuss questions to be addressed as well.We also summarize third-generation sequencing methods,which enable long-read and single-molecule sequencing of DNA and RNA modification.
基金supported by the Zhejiang Lab,China(No.2020KB0AC02)the Zhejiang Provincial Key R&D Program,China(Nos.2022C01022,2022C01119,and 2021C03003)+2 种基金the National Natural Science Foundation of China(Nos.T2293723 and 61972347)the Zhejiang Provincial Natural Science Foundation,China(No.LR19F020005)the Fundamental Research Funds for the Central Universities,China(No.226-2022-00051)。
文摘The deformability and high degree of freedom of mollusks bring challenges in mathematical modeling and synthesis of motions.Traditional analytical and statistical models are limited by either rigid skeleton assumptions or model capacity,and have difficulty in generating realistic and multi-pattern mollusk motions.In this work,we present a large-scale dynamic pose dataset of Drosophila larvae and propose a motion synthesis model named Path2Pose to generate a pose sequence given the initial poses and the subsequent guiding path.The Path2Pose model is further used to synthesize long pose sequences of various motion patterns through a recursive generation method.Evaluation analysis results demonstrate that our novel model synthesizes highly realistic mollusk motions and achieves state-of-the-art performance.Our work proves high performance of deep neural networks for mollusk motion synthesis and the feasibility of long pose sequence synthesis based on the customized body shape and guiding path.