This work proposes a Tensor Train Random Projection(TTRP)method for dimension reduction,where pairwise distances can be approximately preserved.Our TTRP is systematically constructed through a Tensor Train(TT)represen...This work proposes a Tensor Train Random Projection(TTRP)method for dimension reduction,where pairwise distances can be approximately preserved.Our TTRP is systematically constructed through a Tensor Train(TT)representation with TT-ranks equal to one.Based on the tensor train format,this random projection method can speed up the dimension reduction procedure for high-dimensional datasets and requires fewer storage costs with little loss in accuracy,comparedwith existingmethods.We provide a theoretical analysis of the bias and the variance of TTRP,which shows that this approach is an expected isometric projectionwith bounded variance,and we show that the scaling Rademacher variable is an optimal choice for generating the corresponding TT-cores.Detailed numerical experiments with synthetic datasets and theMNIST dataset are conducted to demonstrate the efficiency of TTRP.展开更多
In this article,two new algorithms are presented that convert a given data tensor train into either a Tucker decomposition with orthogonal matrix factors or a multi-scale entanglement renormalization ansatz(MERA).The ...In this article,two new algorithms are presented that convert a given data tensor train into either a Tucker decomposition with orthogonal matrix factors or a multi-scale entanglement renormalization ansatz(MERA).The Tucker core tensor is never explicitly computed but stored as a tensor train instead,resulting in both computationally and storage efficient algorithms.Both the multilinear Tucker-ranks as well as the MERA-ranks are automatically determined by the algorithm for a given upper bound on the relative approximation error.In addition,an iterative algorithm with low computational complexity based on solving an orthogonal Procrustes problem is proposed for the first time to retrieve optimal rank-lowering disentangler tensors,which are a crucial component in the construction of a low-rank MERA.Numerical experiments demonstrate the effectiveness of the proposed algorithms together with the potential storage benefit of a low-rank MERA over a tensor train.展开更多
随着大语言模型(large language models,LLMs)(以下简称“大模型”)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求。传统分布式训练方案通常依赖大量高端GPU和高速互联网络,训练成本极为昂贵。现有单GPU训练...随着大语言模型(large language models,LLMs)(以下简称“大模型”)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求。传统分布式训练方案通常依赖大量高端GPU和高速互联网络,训练成本极为昂贵。现有单GPU训练方案虽通过张量卸载缓解显存压力,但仍然面临I/O传输效率低和设备利用率不足等问题。传统内核态I/O操作在大规模张量迁移中引入频繁的系统调用和上下文切换,成为制约性能的关键瓶颈;同时,优化器计算无法充分发挥多核CPU的并行能力,难以实现与GPU计算的有效重叠,进一步限制了系统性能。针对上述问题,提出了一种面向大模型训练的异构内存卸载与I/O优化方案HiTrain。首先构建了基于存储性能开发工具包(storage performance development kit,SPDK)的高性能张量存储模块,通过在用户态管理张量数据,避免了内核I/O栈开销,从而提高张量卸载的并发性与吞吐率;其次,设计并实现了基于异步优化器的存储-计算流水线调度模块,通过对优化器的执行进行优化重排来减少GPU等待时间,提高整体训练效率。实验结果表明,在配备单张GPU和非易失性存储器快速固态硬盘(non-volatile memory express solid state drive,NVMe SSD)的服务器上,所提出的方案能够充分利用系统中的存算资源,使得模型训练过程中张量卸载与加载效率提升32.7%,整体训练吞吐提升至现有方案的1.49倍,为低成本大模型训练提供了切实可行的技术路径。展开更多
近期语言模型领域的进展显示,采用Transformer架构的大型预训练模型在自然语言处理应用中表现出优异的技术能力。然而,受限于GPU内存,训练大语言模型(large language models,LLMs)成为了一项挑战。张量并行方法要求单个GPU存储所有激活...近期语言模型领域的进展显示,采用Transformer架构的大型预训练模型在自然语言处理应用中表现出优异的技术能力。然而,受限于GPU内存,训练大语言模型(large language models,LLMs)成为了一项挑战。张量并行方法要求单个GPU存储所有激活值,难以突破内存瓶颈。为解决GPU内存对大语言模型训练的制约并提升训练效率,本文提出一种二维张量并行方法(2D tensor parallelism,TP2D)。2D张量并行将输入数据和参数矩阵分割并分配至4个GPU;采用分布式通信,进行GPU间数据的高速交互,实现真正的分布式并行训练。以GPT-2模型作为基准模型,测试了两种训练方法的软扩展(soft scaling)效率和训练效率。实验表明,当使用4块GPU时,2D张量并行的训练速度是张量并行的1.84倍,软扩展效率达到86%,并降低了内存占用。展开更多
We introduce a new tensor integration method for time-dependent partial differential equations(PDEs)that controls the tensor rank of the PDE solution via time-dependent smooth coordinate transformations.Such coordinat...We introduce a new tensor integration method for time-dependent partial differential equations(PDEs)that controls the tensor rank of the PDE solution via time-dependent smooth coordinate transformations.Such coordinate transformations are obtained by solving a sequence of convex optimization problems that minimize the component of the PDE operator responsible for increasing the tensor rank of the PDE solution.The new algorithm improves upon the non-convex algorithm we recently proposed in Dektor and Venturi(2023)which has no guarantee of producing globally optimal rank-reducing coordinate transformations.Numerical applications demonstrating the effectiveness of the new coordinate-adaptive tensor integration method are presented and discussed for prototype Liouville and Fokker-Planck equations.展开更多
基金supported by the NationalNatural Science Foundation of China(No.12071291)the Science and Technology Commission of Shanghai Municipality(No.20JC1414300)the Natural Science Foundation of Shanghai(No.20ZR1436200).
文摘This work proposes a Tensor Train Random Projection(TTRP)method for dimension reduction,where pairwise distances can be approximately preserved.Our TTRP is systematically constructed through a Tensor Train(TT)representation with TT-ranks equal to one.Based on the tensor train format,this random projection method can speed up the dimension reduction procedure for high-dimensional datasets and requires fewer storage costs with little loss in accuracy,comparedwith existingmethods.We provide a theoretical analysis of the bias and the variance of TTRP,which shows that this approach is an expected isometric projectionwith bounded variance,and we show that the scaling Rademacher variable is an optimal choice for generating the corresponding TT-cores.Detailed numerical experiments with synthetic datasets and theMNIST dataset are conducted to demonstrate the efficiency of TTRP.
基金the Ministry of Education and Science of the Russian Federation(grant 14.756.31.0001).
文摘In this article,two new algorithms are presented that convert a given data tensor train into either a Tucker decomposition with orthogonal matrix factors or a multi-scale entanglement renormalization ansatz(MERA).The Tucker core tensor is never explicitly computed but stored as a tensor train instead,resulting in both computationally and storage efficient algorithms.Both the multilinear Tucker-ranks as well as the MERA-ranks are automatically determined by the algorithm for a given upper bound on the relative approximation error.In addition,an iterative algorithm with low computational complexity based on solving an orthogonal Procrustes problem is proposed for the first time to retrieve optimal rank-lowering disentangler tensors,which are a crucial component in the construction of a low-rank MERA.Numerical experiments demonstrate the effectiveness of the proposed algorithms together with the potential storage benefit of a low-rank MERA over a tensor train.
文摘随着大语言模型(large language models,LLMs)(以下简称“大模型”)参数规模的持续增长,微调百亿级参数大模型对计算和存储资源提出了极高要求。传统分布式训练方案通常依赖大量高端GPU和高速互联网络,训练成本极为昂贵。现有单GPU训练方案虽通过张量卸载缓解显存压力,但仍然面临I/O传输效率低和设备利用率不足等问题。传统内核态I/O操作在大规模张量迁移中引入频繁的系统调用和上下文切换,成为制约性能的关键瓶颈;同时,优化器计算无法充分发挥多核CPU的并行能力,难以实现与GPU计算的有效重叠,进一步限制了系统性能。针对上述问题,提出了一种面向大模型训练的异构内存卸载与I/O优化方案HiTrain。首先构建了基于存储性能开发工具包(storage performance development kit,SPDK)的高性能张量存储模块,通过在用户态管理张量数据,避免了内核I/O栈开销,从而提高张量卸载的并发性与吞吐率;其次,设计并实现了基于异步优化器的存储-计算流水线调度模块,通过对优化器的执行进行优化重排来减少GPU等待时间,提高整体训练效率。实验结果表明,在配备单张GPU和非易失性存储器快速固态硬盘(non-volatile memory express solid state drive,NVMe SSD)的服务器上,所提出的方案能够充分利用系统中的存算资源,使得模型训练过程中张量卸载与加载效率提升32.7%,整体训练吞吐提升至现有方案的1.49倍,为低成本大模型训练提供了切实可行的技术路径。
基金supported by the U.S.Air Force Office of Scientific Research(AFOSR)grant FA9550-20-1-0174the U.S.Army Research Office(ARO)grant W911NF-18-1-0309.
文摘We introduce a new tensor integration method for time-dependent partial differential equations(PDEs)that controls the tensor rank of the PDE solution via time-dependent smooth coordinate transformations.Such coordinate transformations are obtained by solving a sequence of convex optimization problems that minimize the component of the PDE operator responsible for increasing the tensor rank of the PDE solution.The new algorithm improves upon the non-convex algorithm we recently proposed in Dektor and Venturi(2023)which has no guarantee of producing globally optimal rank-reducing coordinate transformations.Numerical applications demonstrating the effectiveness of the new coordinate-adaptive tensor integration method are presented and discussed for prototype Liouville and Fokker-Planck equations.