SM9 was established in 2016 as a Chinese ofcial identity-based cryptographic (IBC) standard, and became an ISO standard in 2021. It is well-known that IBC is suitable for Internet of Things (IoT) applications, since a...SM9 was established in 2016 as a Chinese ofcial identity-based cryptographic (IBC) standard, and became an ISO standard in 2021. It is well-known that IBC is suitable for Internet of Things (IoT) applications, since a centralized processing of client data (e.g. IoT cloud) is often done by gateways. However, due to limited computation resources inside IoT devices, the performance of SM9 becomes a bottleneck in practical usage. The existing SM9 implementa-tionsare often CPU-based, with relatively low latency and low throughput. Consequently, a pivotal challenge for SM9 in large-scale applications is how to reduce the latency while maximizing throughput for numerous concurrent inputs. After a systematic analysis of the SM9 algorithms, we apply optimization techniques including precomputa-tion,resource caching and parallelization to reduce the overhead of SM9. In this work, we introduce the frst prac-ticalimplementation of SM9 and its underlying SM9_P256 curve on GPU. Our GPU implementation combines multiple algorithms and low-level optimizations tailored for GPU’s single instruction, multiple threads architecture in order to achieve high throughput for SM9. Based on these, we propose GAPS, a high-performance Cryptog-raphyas a Service (CaaS) for SM9. GAPS adopts a heterogeneous computing architecture that fexibly schedules the inputs across two implementation platforms: a CPU for the low-latency processing of sporadic inputs, and a GPU for the high-throughput processing of batch inputs. According to our benchmark, GAPS only takes a few milliseconds to process a single SM9 request in idle mode. Moreover, when operating in its batch processing mode, GAPS can generate 2,038,071 private keys, 248,239 signatures or 238,001 ciphertexts per second. The results show that GAPS scales seamlessly across inputs of diferent sizes, preliminarily demonstrating the efcacy of our solution.展开更多
Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a war...Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows.展开更多
The newly emerging neural radiance fields(NeRF)methods can implicitly fulfill three-dimensional(3D)reconstruction via training a neural network to render novel-view images of a given scene with given posed images.The ...The newly emerging neural radiance fields(NeRF)methods can implicitly fulfill three-dimensional(3D)reconstruction via training a neural network to render novel-view images of a given scene with given posed images.The Instant Neural Graphics Primitives(Instant-NGP)method further improves the position encoding of NeRF.It obtains state-of-the-art efficiency.However,only a local pixel-wised loss is considered when training the Instant-NGP while overlooking the nonlocal structural information between pixels.Despite a good quantitative result,it leads to a poor visual effect,especially the completeness.Inspired by the stochastic structural similarity(S3IM)method that exploits nonlocal structural information of groups of pixels,this paper proposes a new method to improve the completeness of fast novel view synthesis.The proposed method first extends the thread-wised processing of the Instant-NGP to the processing in a customthread block(i.e.,a group of threads).Then,the relative dimensionless global error in synthesis,i.e.,Erreur Relative Globale Adimensionnelle de Synthese(ERGAS),of a group of pixels corresponding to a group of threads is computed and incorporated into the loss function.Extensive experiments validate the proposed method.It can obtain better quantitative results than the original Instant-NGP with fewer iteration steps.PSNR is increased by 1%.Amazing qualitative results are obtained,especially for delicate structures and details such as lines and continuous structures.With the dramatic improvements in the visual effects,our method can boost the practicability of implicit 3D reconstruction in applications such as self-driving and augmented reality.展开更多
Solute transport simulations are important in water pollution events.This paper introduces a finite volume Godunovtype model for solving a 4×4 matrix form of the hyperbolic conservation laws consisting of 2D shal...Solute transport simulations are important in water pollution events.This paper introduces a finite volume Godunovtype model for solving a 4×4 matrix form of the hyperbolic conservation laws consisting of 2D shallow water equations and transport equations.The model adopts the Harten-Lax-van Leer-contact(HLLC)-approximate Riemann solution to calculate the cell interface fluxes.It can deal well with the changes in the dry and wet interfaces in an actual complex terrain,and it has a strong shock-wave capturing ability.Using monotonic upstream-centred scheme for conservation laws(MUSCL)linear reconstruction with finite slope and the Runge-Kutta time integration method can achieve second-order accuracy.At the same time,the introduction of graphics processing unit(GPU)-accelerated computing technology greatly increases the computing speed.The model is validated against multiple benchmarks,and the results are in good agreement with analytical solutions and other published numerical predictions.The third test case uses the GPU and central processing unit(CPU)calculation models which take 3.865 s and 13.865 s,respectively,indicating that the GPU calculation model can increase the calculation speed by 3.6 times.In the fourth test case,comparing the numerical model calculated by GPU with the traditional numerical model calculated by CPU,the calculation efficiencies of the numerical model calculated by GPU under different resolution grids are 9.8–44.6 times higher than those by CPU.Therefore,it has better potential than previous models for large-scale simulation of solute transport in water pollution incidents.It can provide a reliable theoretical basis and strong data support in the rapid assessment and early warning of water pollution accidents.展开更多
基金supported by National Natural Science Foundation of China(Nos.62172411,62172404,61972094,and 62202458).
文摘SM9 was established in 2016 as a Chinese ofcial identity-based cryptographic (IBC) standard, and became an ISO standard in 2021. It is well-known that IBC is suitable for Internet of Things (IoT) applications, since a centralized processing of client data (e.g. IoT cloud) is often done by gateways. However, due to limited computation resources inside IoT devices, the performance of SM9 becomes a bottleneck in practical usage. The existing SM9 implementa-tionsare often CPU-based, with relatively low latency and low throughput. Consequently, a pivotal challenge for SM9 in large-scale applications is how to reduce the latency while maximizing throughput for numerous concurrent inputs. After a systematic analysis of the SM9 algorithms, we apply optimization techniques including precomputa-tion,resource caching and parallelization to reduce the overhead of SM9. In this work, we introduce the frst prac-ticalimplementation of SM9 and its underlying SM9_P256 curve on GPU. Our GPU implementation combines multiple algorithms and low-level optimizations tailored for GPU’s single instruction, multiple threads architecture in order to achieve high throughput for SM9. Based on these, we propose GAPS, a high-performance Cryptog-raphyas a Service (CaaS) for SM9. GAPS adopts a heterogeneous computing architecture that fexibly schedules the inputs across two implementation platforms: a CPU for the low-latency processing of sporadic inputs, and a GPU for the high-throughput processing of batch inputs. According to our benchmark, GAPS only takes a few milliseconds to process a single SM9 request in idle mode. Moreover, when operating in its batch processing mode, GAPS can generate 2,038,071 private keys, 248,239 signatures or 238,001 ciphertexts per second. The results show that GAPS scales seamlessly across inputs of diferent sizes, preliminarily demonstrating the efcacy of our solution.
基金the National Natural Science Foundation of China(No.61702521)the Natural Science Foundation of Tianjin(No.18JCQNJC00400)+1 种基金the Scientific Research Foundation of Civil Aviation University of China(No.2017QD12S)the Fundamental Research Funds for the Central Universities of Civil Aviation University of China(Nos.3122018C023 and 3122018C021)。
文摘Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows.
基金supported in part by National Natural Science Foundation of China under Grant No.62473013Key Project of Science and Technology Innovation and Entrepreneurship of TDTEC(No.2022-TDZD004).
文摘The newly emerging neural radiance fields(NeRF)methods can implicitly fulfill three-dimensional(3D)reconstruction via training a neural network to render novel-view images of a given scene with given posed images.The Instant Neural Graphics Primitives(Instant-NGP)method further improves the position encoding of NeRF.It obtains state-of-the-art efficiency.However,only a local pixel-wised loss is considered when training the Instant-NGP while overlooking the nonlocal structural information between pixels.Despite a good quantitative result,it leads to a poor visual effect,especially the completeness.Inspired by the stochastic structural similarity(S3IM)method that exploits nonlocal structural information of groups of pixels,this paper proposes a new method to improve the completeness of fast novel view synthesis.The proposed method first extends the thread-wised processing of the Instant-NGP to the processing in a customthread block(i.e.,a group of threads).Then,the relative dimensionless global error in synthesis,i.e.,Erreur Relative Globale Adimensionnelle de Synthese(ERGAS),of a group of pixels corresponding to a group of threads is computed and incorporated into the loss function.Extensive experiments validate the proposed method.It can obtain better quantitative results than the original Instant-NGP with fewer iteration steps.PSNR is increased by 1%.Amazing qualitative results are obtained,especially for delicate structures and details such as lines and continuous structures.With the dramatic improvements in the visual effects,our method can boost the practicability of implicit 3D reconstruction in applications such as self-driving and augmented reality.
基金Project supported by the National Natural Science Foundation of China(Nos.52009104 and 52079106)the Shaanxi Provincial Department of Water Resources Project(No.2017slkj-14)the Shaanxi Provincial Department of Science and Technology Project(No.2017JQ3043),China。
文摘Solute transport simulations are important in water pollution events.This paper introduces a finite volume Godunovtype model for solving a 4×4 matrix form of the hyperbolic conservation laws consisting of 2D shallow water equations and transport equations.The model adopts the Harten-Lax-van Leer-contact(HLLC)-approximate Riemann solution to calculate the cell interface fluxes.It can deal well with the changes in the dry and wet interfaces in an actual complex terrain,and it has a strong shock-wave capturing ability.Using monotonic upstream-centred scheme for conservation laws(MUSCL)linear reconstruction with finite slope and the Runge-Kutta time integration method can achieve second-order accuracy.At the same time,the introduction of graphics processing unit(GPU)-accelerated computing technology greatly increases the computing speed.The model is validated against multiple benchmarks,and the results are in good agreement with analytical solutions and other published numerical predictions.The third test case uses the GPU and central processing unit(CPU)calculation models which take 3.865 s and 13.865 s,respectively,indicating that the GPU calculation model can increase the calculation speed by 3.6 times.In the fourth test case,comparing the numerical model calculated by GPU with the traditional numerical model calculated by CPU,the calculation efficiencies of the numerical model calculated by GPU under different resolution grids are 9.8–44.6 times higher than those by CPU.Therefore,it has better potential than previous models for large-scale simulation of solute transport in water pollution incidents.It can provide a reliable theoretical basis and strong data support in the rapid assessment and early warning of water pollution accidents.