Gastric cancer(GC)is a major cause of cancer-related mortality worldwide.GC is determined by multiple(epi)genetic and environmental factors;can occur at distinct anatomic positions of the stomach;and displays high het...Gastric cancer(GC)is a major cause of cancer-related mortality worldwide.GC is determined by multiple(epi)genetic and environmental factors;can occur at distinct anatomic positions of the stomach;and displays high heterogeneity,with different cellular origins and diverse histological and molecular features.This heterogeneity has hindered efforts to fully understand the pathology of GC and develop efficient therapeutics.In the past decade,great progress has been made in the study of GC,particularly in molecular subtyping,investigation of the immune microenvironment,and defining the evolutionary path and dynamics.Preclinical mouse models,particularly immunocompetent models that mimic the cellular and molecular features of human GC,in combination with organoid culture and clinical studies,have provided powerful tools for elucidating the molecular and cellular mechanisms underlying GC pathology and immune evasion,and the development of novel therapeutic strategies.Herein,we first briefly introduce current progress and challenges in GC study and subsequently summarize immunocompetent GC mouse models,emphasizing the potential application of genetically engineered mouse models in antitumor immunity and immunotherapy studies.展开更多
Convolution algorithms based on the Winograd implementation can reduce computational complexity and are widely used in CNNs.As an emerging GPU-like accelerator,DCU has achieved some performance optimization for the Wi...Convolution algorithms based on the Winograd implementation can reduce computational complexity and are widely used in CNNs.As an emerging GPU-like accelerator,DCU has achieved some performance optimization for the Winograd algorithm,but it fails to fully exploit the Matrix Cores of DCU to further enhance the efficiency of Winograd convolution computations.This paper proposes an improved fused Winograd convolution optimization scheme that integrates all transformation stages into a single kernel,which is specifically designed to exploit the characteristics of Matrix Cores.In the input transformation stage,we design an efficient data reuse mechanism that reduces redundant global memory accesses.In the element-wise matrix multiplication stage,we transform Hadamard products into batched GEMMs,boosting computational intensity and complying with the data layout requirements of Matrix Cores.During kernel fusion,we eliminate shared memory bank conflicts by reorganizing thread layout and further introduce software pipelining to effectively mask memory access latency.The results show that our method achieves average speedups of 1.35×and 1.72×(up to 1.81×and 2.78×)over the Winograd and Implicit GEMM algorithms in MIOpen under FP16 mode,and 1.22×and 1.53×(up to 1.55×and 1.88×)under FP32 mode.展开更多
Standard convolution remains a major performance bottleneck in modern deep neural networks.Although existing optimization libraries demonstrate effectiveness,they often underutilize key architectural features of emerg...Standard convolution remains a major performance bottleneck in modern deep neural networks.Although existing optimization libraries demonstrate effectiveness,they often underutilize key architectural features of emerging accelerators like DCUs,leading to suboptimal performance.To address this limitation,we propose a holistic,architecture-aware framework that systematically co-optimizes memory hierarchy and computational pipelines.The framework dynamically adapts to convolution parameters for maximal hardware utilization,with core contributions including:an innovative memory management strategy mitigating access conflicts,an adaptive computation pipeline balancing parallelism and data reuse,and a method bypassing API limitations to leverage underlying hardware instructions.On DCU hardware,our framework achieves significant speedups over MIOpen-delivering 3.09×and 1.64×average acceleration for FP16 and FP32 precision respectively,while reducing end-to-end training time for ResNet and EfficientNet by 6.7%and 12.1%.展开更多
Optimizing GEneral Matrix Multiplication(GEMM)on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research.While significant progress has been mad...Optimizing GEneral Matrix Multiplication(GEMM)on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research.While significant progress has been made in accelerating high-precision GEMM,the optimization of low-bit GEMM remains a challenging open problem.The CUTLASS library provides highly optimized low-bit GEMM templates leveraging Tensor Cores;however,performance varies considerably depending on tile and pipeline configurations across different GPU architectures.In this work,we propose a novel auto-tuning framework for low-bit CUTLASS GEMM,utilizing a neural network model to predict optimal GEMM template parameters for target GPUs.Our model is trained on a synthetic dataset with up to 116100 unique samples,encompassing diverse matrix sizes across various Ampere GPUs,and is thoroughly evaluated on these hardware platforms.Experimental results show that our method achieves an accuracy of up to 95.11%on the validation dataset.Furthermore,real-time evaluations of low-bit data types on the A100 GPU demonstrate speedups of up to 1.99×for GEMM operations and 1.28×for the linear layer,compared to the default CUTLASS templates.展开更多
基金supported by the National Key R&D Program of China(Grant No.2020YFA0803200 and 2023YFC2505903)National Natural Science Foundation of China(Grant Nos.82003014,31930026,81972876,82150112,92168116,81725014,81822035,and 82222052)+1 种基金China Postdoctoral Science Foundation(Grant No.2020M671231)Fundamental Research Funds for the Central Universities(Grant No.22120240327)。
文摘Gastric cancer(GC)is a major cause of cancer-related mortality worldwide.GC is determined by multiple(epi)genetic and environmental factors;can occur at distinct anatomic positions of the stomach;and displays high heterogeneity,with different cellular origins and diverse histological and molecular features.This heterogeneity has hindered efforts to fully understand the pathology of GC and develop efficient therapeutics.In the past decade,great progress has been made in the study of GC,particularly in molecular subtyping,investigation of the immune microenvironment,and defining the evolutionary path and dynamics.Preclinical mouse models,particularly immunocompetent models that mimic the cellular and molecular features of human GC,in combination with organoid culture and clinical studies,have provided powerful tools for elucidating the molecular and cellular mechanisms underlying GC pathology and immune evasion,and the development of novel therapeutic strategies.Herein,we first briefly introduce current progress and challenges in GC study and subsequently summarize immunocompetent GC mouse models,emphasizing the potential application of genetically engineered mouse models in antitumor immunity and immunotherapy studies.
基金funded by the National Key Research and Development Program of China(2023ZD0120604)the National Key Research and Development Program of China(2024YFB4504103)the Major Science and Technology Special Projects in Henan Province(241111212300).
文摘Convolution algorithms based on the Winograd implementation can reduce computational complexity and are widely used in CNNs.As an emerging GPU-like accelerator,DCU has achieved some performance optimization for the Winograd algorithm,but it fails to fully exploit the Matrix Cores of DCU to further enhance the efficiency of Winograd convolution computations.This paper proposes an improved fused Winograd convolution optimization scheme that integrates all transformation stages into a single kernel,which is specifically designed to exploit the characteristics of Matrix Cores.In the input transformation stage,we design an efficient data reuse mechanism that reduces redundant global memory accesses.In the element-wise matrix multiplication stage,we transform Hadamard products into batched GEMMs,boosting computational intensity and complying with the data layout requirements of Matrix Cores.During kernel fusion,we eliminate shared memory bank conflicts by reorganizing thread layout and further introduce software pipelining to effectively mask memory access latency.The results show that our method achieves average speedups of 1.35×and 1.72×(up to 1.81×and 2.78×)over the Winograd and Implicit GEMM algorithms in MIOpen under FP16 mode,and 1.22×and 1.53×(up to 1.55×and 1.88×)under FP32 mode.
基金funded by the Science and Technology Innovation 2030(2023ZD0120604)H.Hua and L.Zhang were supported by Basic Research Projects of Key Scientific Research Projects Plan in Henan Higher Education Institutions(25ZX013)the Scientific Research Team Plan of Zhengzhou University of Aeronautics(23ZHTD01003).
文摘Standard convolution remains a major performance bottleneck in modern deep neural networks.Although existing optimization libraries demonstrate effectiveness,they often underutilize key architectural features of emerging accelerators like DCUs,leading to suboptimal performance.To address this limitation,we propose a holistic,architecture-aware framework that systematically co-optimizes memory hierarchy and computational pipelines.The framework dynamically adapts to convolution parameters for maximal hardware utilization,with core contributions including:an innovative memory management strategy mitigating access conflicts,an adaptive computation pipeline balancing parallelism and data reuse,and a method bypassing API limitations to leverage underlying hardware instructions.On DCU hardware,our framework achieves significant speedups over MIOpen-delivering 3.09×and 1.64×average acceleration for FP16 and FP32 precision respectively,while reducing end-to-end training time for ResNet and EfficientNet by 6.7%and 12.1%.
基金supported by the Federal Ministry of Research,Technology and Space under the funding code“KI-Servicezentrum Berlin-Brandenburg”16IS22092.
文摘Optimizing GEneral Matrix Multiplication(GEMM)on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research.While significant progress has been made in accelerating high-precision GEMM,the optimization of low-bit GEMM remains a challenging open problem.The CUTLASS library provides highly optimized low-bit GEMM templates leveraging Tensor Cores;however,performance varies considerably depending on tile and pipeline configurations across different GPU architectures.In this work,we propose a novel auto-tuning framework for low-bit CUTLASS GEMM,utilizing a neural network model to predict optimal GEMM template parameters for target GPUs.Our model is trained on a synthetic dataset with up to 116100 unique samples,encompassing diverse matrix sizes across various Ampere GPUs,and is thoroughly evaluated on these hardware platforms.Experimental results show that our method achieves an accuracy of up to 95.11%on the validation dataset.Furthermore,real-time evaluations of low-bit data types on the A100 GPU demonstrate speedups of up to 1.99×for GEMM operations and 1.28×for the linear layer,compared to the default CUTLASS templates.