In this study,we investigate the ef-ficacy of a hybrid parallel algo-rithm aiming at enhancing the speed of evaluation of two-electron repulsion integrals(ERI)and Fock matrix generation on the Hygon C86/DCU(deep compu...In this study,we investigate the ef-ficacy of a hybrid parallel algo-rithm aiming at enhancing the speed of evaluation of two-electron repulsion integrals(ERI)and Fock matrix generation on the Hygon C86/DCU(deep computing unit)heterogeneous computing platform.Multiple hybrid parallel schemes are assessed using a range of model systems,including those with up to 1200 atoms and 10000 basis func-tions.The findings of our research reveal that,during Hartree-Fock(HF)calculations,a single DCU ex-hibits 33.6 speedups over 32 C86 CPU cores.Compared with the efficiency of Wuhan Electronic Structure Package on Intel X86 and NVIDIA A100 computing platform,the Hygon platform exhibits good cost-effective-ness,showing great potential in quantum chemistry calculation and other high-performance scientific computations.展开更多
The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence(AI)chips.To achieve higher performance an...The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence(AI)chips.To achieve higher performance and lower power,a comprehensive and efficient approach is required to compile neural networks for implementation on dedicated hardware.Our first-generation deep learning accelerator,tensor computing unit,was presented with hardware and software solutions.It offered dedicated very long instruction words(VLIWs)instructions and multi-level repeatable direct memory access(DMA).The former lowers the instruction bandwidth requirement and makes it easier to parallelize the index and vector computations.The latter reduces the communication latency between the compute core and the asynchronous DMA,and also greatly alleviates the programming complexity.For operator implementation and optimization,the compiler-based data-flow generator and the instruction macro generator first produced a set of parameterized operators.Then,the tunerconfiguration generator pruned the search space and the distributed tuner framework selected the best data-flow pattern and corresponding parameters.Our tensor computing unit supports all the convolution parameters with full-shape dimensions.It can readily select proper operators to achieve 96%of the chip peak performance under certain shapes and find the best performance implementation within limited power.The evaluation of a large number of convolution shapes on our tensor computing unit chip shows the generated operators significantly outperform the handwritten ones,achieving 9%higher normalized performance than CUDA according to the silicon data.展开更多
Three recent breakthroughs due to AI in arts and science serve as motivation:An award winning digital image,protein folding,fast matrix multiplication.Many recent developments in artificial neural networks,particularl...Three recent breakthroughs due to AI in arts and science serve as motivation:An award winning digital image,protein folding,fast matrix multiplication.Many recent developments in artificial neural networks,particularly deep learning(DL),applied and relevant to computational mechanics(solid,fluids,finite-element technology)are reviewed in detail.Both hybrid and pure machine learning(ML)methods are discussed.Hybrid methods combine traditional PDE discretizations with ML methods either(1)to help model complex nonlinear constitutive relations,(2)to nonlinearly reduce the model order for efficient simulation(turbulence),or(3)to accelerate the simulation by predicting certain components in the traditional integration methods.Here,methods(1)and(2)relied on Long-Short-Term Memory(LSTM)architecture,with method(3)relying on convolutional neural networks.Pure ML methods to solve(nonlinear)PDEs are represented by Physics-Informed Neural network(PINN)methods,which could be combined with attention mechanism to address discontinuous solutions.Both LSTM and attention architectures,together with modern and generalized classic optimizers to include stochasticity for DL networks,are extensively reviewed.Kernel machines,including Gaussian processes,are provided to sufficient depth for more advanced works such as shallow networks with infinite width.Not only addressing experts,readers are assumed familiar with computational mechanics,but not with DL,whose concepts and applications are built up from the basics,aiming at bringing first-time learners quickly to the forefront of research.History and limitations of AI are recounted and discussed,with particular attention at pointing out misstatements or misconceptions of the classics,even in well-known references.Positioning and pointing control of a large-deformable beam is given as an example.展开更多
基金supported by the National Natural Science Foundation of China(No.22373112 to Ji Qi,No.22373111 and 21921004 to Minghui Yang)GH-fund A(No.202107011790)。
文摘In this study,we investigate the ef-ficacy of a hybrid parallel algo-rithm aiming at enhancing the speed of evaluation of two-electron repulsion integrals(ERI)and Fock matrix generation on the Hygon C86/DCU(deep computing unit)heterogeneous computing platform.Multiple hybrid parallel schemes are assessed using a range of model systems,including those with up to 1200 atoms and 10000 basis func-tions.The findings of our research reveal that,during Hartree-Fock(HF)calculations,a single DCU ex-hibits 33.6 speedups over 32 C86 CPU cores.Compared with the efficiency of Wuhan Electronic Structure Package on Intel X86 and NVIDIA A100 computing platform,the Hygon platform exhibits good cost-effective-ness,showing great potential in quantum chemistry calculation and other high-performance scientific computations.
文摘The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence(AI)chips.To achieve higher performance and lower power,a comprehensive and efficient approach is required to compile neural networks for implementation on dedicated hardware.Our first-generation deep learning accelerator,tensor computing unit,was presented with hardware and software solutions.It offered dedicated very long instruction words(VLIWs)instructions and multi-level repeatable direct memory access(DMA).The former lowers the instruction bandwidth requirement and makes it easier to parallelize the index and vector computations.The latter reduces the communication latency between the compute core and the asynchronous DMA,and also greatly alleviates the programming complexity.For operator implementation and optimization,the compiler-based data-flow generator and the instruction macro generator first produced a set of parameterized operators.Then,the tunerconfiguration generator pruned the search space and the distributed tuner framework selected the best data-flow pattern and corresponding parameters.Our tensor computing unit supports all the convolution parameters with full-shape dimensions.It can readily select proper operators to achieve 96%of the chip peak performance under certain shapes and find the best performance implementation within limited power.The evaluation of a large number of convolution shapes on our tensor computing unit chip shows the generated operators significantly outperform the handwritten ones,achieving 9%higher normalized performance than CUDA according to the silicon data.
文摘Three recent breakthroughs due to AI in arts and science serve as motivation:An award winning digital image,protein folding,fast matrix multiplication.Many recent developments in artificial neural networks,particularly deep learning(DL),applied and relevant to computational mechanics(solid,fluids,finite-element technology)are reviewed in detail.Both hybrid and pure machine learning(ML)methods are discussed.Hybrid methods combine traditional PDE discretizations with ML methods either(1)to help model complex nonlinear constitutive relations,(2)to nonlinearly reduce the model order for efficient simulation(turbulence),or(3)to accelerate the simulation by predicting certain components in the traditional integration methods.Here,methods(1)and(2)relied on Long-Short-Term Memory(LSTM)architecture,with method(3)relying on convolutional neural networks.Pure ML methods to solve(nonlinear)PDEs are represented by Physics-Informed Neural network(PINN)methods,which could be combined with attention mechanism to address discontinuous solutions.Both LSTM and attention architectures,together with modern and generalized classic optimizers to include stochasticity for DL networks,are extensively reviewed.Kernel machines,including Gaussian processes,are provided to sufficient depth for more advanced works such as shallow networks with infinite width.Not only addressing experts,readers are assumed familiar with computational mechanics,but not with DL,whose concepts and applications are built up from the basics,aiming at bringing first-time learners quickly to the forefront of research.History and limitations of AI are recounted and discussed,with particular attention at pointing out misstatements or misconceptions of the classics,even in well-known references.Positioning and pointing control of a large-deformable beam is given as an example.
文摘现有的光流估计网络为了获得更高的精度,往往使用相关性成本量和门控循环单元(gate recurrent unit,GRU)来进行迭代优化,但是这样会导致计算量大并限制了在边缘设备上的部署性能。为了实现更轻量的光流估计方法,本文提出局部约束与局部扩张模块(local constraint and local dilation module,LC-LD module),通过结合卷积和一次轴注意力来替代自注意力,以较低的计算量对每个匹配特征点周边区域内不同重要程度的关注,生成更准确的相关性成本量,进而降低迭代次数,达到更轻量化的目的。其次,提出了混洗凸优化上采样,通过将分组卷积、混洗操作与凸优化上采样相结合,在实现其参数数量降低的同时进一步提高精度。实验结果证明了该方法在保证高精度的同时,运行效率显著提升,具有较高的应用前景。