As the complexity of deep learning(DL)networks and training data grows enormously,methods that scale with computation are becoming the future of artificial intelligence(AI)development.In this regard,the interplay betw...As the complexity of deep learning(DL)networks and training data grows enormously,methods that scale with computation are becoming the future of artificial intelligence(AI)development.In this regard,the interplay between machine learning(ML)and high-performance computing(HPC)is an innovative paradigm to speed up the efficiency of AI research and development.However,building and operating an HPC/AI converged system require broad knowledge to leverage the latest computing,networking,and storage technologies.Moreover,an HPC-based AI computing environment needs an appropriate resource allocation and monitoring strategy to efficiently utilize the system resources.In this regard,we introduce a technique for building and operating a high-performance AI-computing environment with the latest technologies.Specifically,an HPC/AI converged system is configured inside Gwangju Institute of Science and Technology(GIST),called GIST AI-X computing cluster,which is built by leveraging the latest Nvidia DGX servers,high-performance storage and networking devices,and various open source tools.Therefore,it can be a good reference for building a small or middlesized HPC/AI converged system for research and educational institutes.In addition,we propose a resource allocation method for DL jobs to efficiently utilize the computing resources with multi-agent deep reinforcement learning(mDRL).Through extensive simulations and experiments,we validate that the proposed mDRL algorithm can help the HPC/AI converged cluster to achieve both system utilization and power consumption improvement.By deploying the proposed resource allocation method to the system,total job completion time is reduced by around 20%and inefficient power consumption is reduced by around 40%.展开更多
以ChatGPT为代表的生成式大模型,展现出前所未有的通用能力,并驱动科学研究范式向科学智能(AI for science)转变。这一趋势使得高性能计算与人工智能计算加速融合,超智融合的智能超算系统成为支撑未来大模型发展与科学发现的关键基础设...以ChatGPT为代表的生成式大模型,展现出前所未有的通用能力,并驱动科学研究范式向科学智能(AI for science)转变。这一趋势使得高性能计算与人工智能计算加速融合,超智融合的智能超算系统成为支撑未来大模型发展与科学发现的关键基础设施。本文对智能超算系统在这一历史性交汇点上面临的重大机遇进行了系统分析,并深入探讨了其在计算芯片、体系结构、硬件系统、软件生态、可靠性与能耗等方面所遭遇的严峻挑战,提出需要软硬件协同设计与全产业链的紧密合作,从而为构建高效、普惠、可持续的新一代智能超算系统奠定基础。展开更多
高性能计算(HPC)技术演进始终与国防军事、基础科学及产业工程等领域的战略需求紧密交织,其发展历程大致可划分为专用向量机、大规模并行计算机、异构并行计算机和超智融合计算机4个关键阶段,各阶段在体系结构、软件生态和应用模式上不...高性能计算(HPC)技术演进始终与国防军事、基础科学及产业工程等领域的战略需求紧密交织,其发展历程大致可划分为专用向量机、大规模并行计算机、异构并行计算机和超智融合计算机4个关键阶段,各阶段在体系结构、软件生态和应用模式上不断演进。当前,高性能计算正经历一场由人工智能驱动的深刻范式转移,“AI for Science”成为一种新型科学研究范式,科学计算的高性能、高精度与智能计算的高性能、混合精度特征呈现出深度融合态势,对底层计算架构在精度协同、数据交换以及I/O模式适配等方面提出了严峻挑战。展望未来基于超智融合的高性能计算技术发展,竞争焦点正从单一的浮点峰值性能,转向数据搬移效率、能效比以及系统可扩展性的综合考量。计算单元间更紧密的集成、更高效的数据流动以及更统一的编程抽象,将成为下一代高性能计算系统的关键特征。CPU-SIMT融合计算架构作为一种有前景的超智融合计算体系结构,采用的“融合计算架构+层次化互连网络+融合并行存储”方案,有望突破超智融合紧耦合计算应用的“通信墙”瓶颈,为构建下一代高性能计算系统提供新的技术路径,高效支撑新型“AI for Science”计算范式应用。展开更多
Artificial Intelligence(AI),particularly in the fields of Machine learning(ML)and Deep learning(DL)have become important tools for the scientific community in general.By coupling traditional modelling and simulation w...Artificial Intelligence(AI),particularly in the fields of Machine learning(ML)and Deep learning(DL)have become important tools for the scientific community in general.By coupling traditional modelling and simulation with the new AI approaches in a hybrid fashion we can advance our current science in leaps and bounds.In the field of Earth System Science(ESS)this is particularly so and,although the technology is still somewhat nascent,the opportunities and potential benefits are enormous.Furthermore,this mixed approach offers a pathway to solving the extremely demanding future science goals in this space that cannot be solved through computing capability alone.This paper examines the state of the art in the application of HPC+AI to the domain of ESS,identifying several important application areas and techniques for hybrid modelling.We also look at the challenges that currently limit widespread adoption of hybrid modelling and delve into potential solutions to those limitations.展开更多
文摘As the complexity of deep learning(DL)networks and training data grows enormously,methods that scale with computation are becoming the future of artificial intelligence(AI)development.In this regard,the interplay between machine learning(ML)and high-performance computing(HPC)is an innovative paradigm to speed up the efficiency of AI research and development.However,building and operating an HPC/AI converged system require broad knowledge to leverage the latest computing,networking,and storage technologies.Moreover,an HPC-based AI computing environment needs an appropriate resource allocation and monitoring strategy to efficiently utilize the system resources.In this regard,we introduce a technique for building and operating a high-performance AI-computing environment with the latest technologies.Specifically,an HPC/AI converged system is configured inside Gwangju Institute of Science and Technology(GIST),called GIST AI-X computing cluster,which is built by leveraging the latest Nvidia DGX servers,high-performance storage and networking devices,and various open source tools.Therefore,it can be a good reference for building a small or middlesized HPC/AI converged system for research and educational institutes.In addition,we propose a resource allocation method for DL jobs to efficiently utilize the computing resources with multi-agent deep reinforcement learning(mDRL).Through extensive simulations and experiments,we validate that the proposed mDRL algorithm can help the HPC/AI converged cluster to achieve both system utilization and power consumption improvement.By deploying the proposed resource allocation method to the system,total job completion time is reduced by around 20%and inefficient power consumption is reduced by around 40%.
文摘以ChatGPT为代表的生成式大模型,展现出前所未有的通用能力,并驱动科学研究范式向科学智能(AI for science)转变。这一趋势使得高性能计算与人工智能计算加速融合,超智融合的智能超算系统成为支撑未来大模型发展与科学发现的关键基础设施。本文对智能超算系统在这一历史性交汇点上面临的重大机遇进行了系统分析,并深入探讨了其在计算芯片、体系结构、硬件系统、软件生态、可靠性与能耗等方面所遭遇的严峻挑战,提出需要软硬件协同设计与全产业链的紧密合作,从而为构建高效、普惠、可持续的新一代智能超算系统奠定基础。
文摘高性能计算(HPC)技术演进始终与国防军事、基础科学及产业工程等领域的战略需求紧密交织,其发展历程大致可划分为专用向量机、大规模并行计算机、异构并行计算机和超智融合计算机4个关键阶段,各阶段在体系结构、软件生态和应用模式上不断演进。当前,高性能计算正经历一场由人工智能驱动的深刻范式转移,“AI for Science”成为一种新型科学研究范式,科学计算的高性能、高精度与智能计算的高性能、混合精度特征呈现出深度融合态势,对底层计算架构在精度协同、数据交换以及I/O模式适配等方面提出了严峻挑战。展望未来基于超智融合的高性能计算技术发展,竞争焦点正从单一的浮点峰值性能,转向数据搬移效率、能效比以及系统可扩展性的综合考量。计算单元间更紧密的集成、更高效的数据流动以及更统一的编程抽象,将成为下一代高性能计算系统的关键特征。CPU-SIMT融合计算架构作为一种有前景的超智融合计算体系结构,采用的“融合计算架构+层次化互连网络+融合并行存储”方案,有望突破超智融合紧耦合计算应用的“通信墙”瓶颈,为构建下一代高性能计算系统提供新的技术路径,高效支撑新型“AI for Science”计算范式应用。
文摘Artificial Intelligence(AI),particularly in the fields of Machine learning(ML)and Deep learning(DL)have become important tools for the scientific community in general.By coupling traditional modelling and simulation with the new AI approaches in a hybrid fashion we can advance our current science in leaps and bounds.In the field of Earth System Science(ESS)this is particularly so and,although the technology is still somewhat nascent,the opportunities and potential benefits are enormous.Furthermore,this mixed approach offers a pathway to solving the extremely demanding future science goals in this space that cannot be solved through computing capability alone.This paper examines the state of the art in the application of HPC+AI to the domain of ESS,identifying several important application areas and techniques for hybrid modelling.We also look at the challenges that currently limit widespread adoption of hybrid modelling and delve into potential solutions to those limitations.