Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N...Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.展开更多
A microtubule gliding assay is a biological experiment observing the dynamics of microtubules driven by motor proteins fixed on a glass surface. When appropriate microtubule interactions are set up on gliding assay ex...A microtubule gliding assay is a biological experiment observing the dynamics of microtubules driven by motor proteins fixed on a glass surface. When appropriate microtubule interactions are set up on gliding assay experiments, microtubules often organize and create higher-level dynamics such as ring and bundle structures. In order to reproduce such higher-level dynamics on computers, we have been focusing on making a real-time 3D microtubule simulation. This real-time 3D microtubule simulation enables us to gain more knowledge on microtubule dynamics and their swarm movements by means of adjusting simulation paranleters in a real-time fashion. One of the technical challenges when creating a real-time 3D simulation is balancing the 3D rendering and the computing performance. Graphics processor unit (GPU) programming plays an essential role in balancing the millions of tasks, and makes this real-time 3D simulation possible. By the use of general-purpose computing on graphics processing units (GPGPU) programming we are able to run the simulation in a massively parallel fashion, even when dealing with more complex interactions between microtubules such as overriding and snuggling. Due to performance being an important factor, a performance n, odel has also been constructed from the analysis of the microtubule simulation and it is consistent with the performance measurements on different GPGPU architectures with regards to the number of cores and clock cycles.展开更多
Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremend...Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.展开更多
The rapid development of internet of things(loT)urgently needs edge miniaturized computing devices with high efficiency and low-power consumption.In-sensor computing has emerged as a promising technology to enable in-...The rapid development of internet of things(loT)urgently needs edge miniaturized computing devices with high efficiency and low-power consumption.In-sensor computing has emerged as a promising technology to enable in-situ data processing within the sensor array.Here,we report an optoelectronic array for in-sensor computing by integrating photodiodes(PDs)with resistive random-access memories(RRAMs).The PD-RRAM unit cell exhibits reconfigurable optoelectronic output and photo-responsivity by programming RRAMs into different resistance states.Furthermore,a 3×3 PD-RRAM array is fabricated to demonstrate optical image recognition,achieving a universal architecture with ultralow latency and low power consumption.This study highlights the great potential of the PD-RRAM optoelectronic array as an energy-effcient in-sensor computing primitive for future IoT applications.展开更多
As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalen...As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalent in today’s digital world.In this study,we propose two high-performance R solutions for GWR via Multi-core Parallel(MP)and Compute Unified Device Architecture(CUDA)techniques,respectively GWR-MP and GWR-CUDA.We compared GWR-MP and GWR-CUDA with three existing solutions available in Geographically Weighted Models(GWmodel),Multi-scale GWR(MGWR)and Fast GWR(FastGWR).Results showed that all five solutions perform differently across varying sample sizes,with no single solution a clear winner in terms of computational efficiency.Specifically,solutions given in GWmodel and MGWR provided acceptable computational costs for GWR studies with a relatively small sample size.For a large sample size,GWR-MP and FastGWR provided coherent solutions on a Personal Computer(PC)with a common multi-core configuration,GWR-MP provided more efficient computing capacity for each core or thread than FastGWR.For cases when the sample size was very large,and for these cases only,GWR-CUDA provided the most efficient solution,but should note its I/O cost with small samples.In summary,GWR-MP and GWR-CUDA provided complementary high-performance R solutions to existing ones,where for certain data-rich GWR studies,they should be preferred.展开更多
A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU...A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU speedup through programming under the compute unified device architecture(CUDA)model.The CUDA kernel subroutines are designed to meet with the requirement of high order computing of DG method.The corresponding data structures are constructed in component-wised manners and the thread hierarchy is manipulated in cell-wised or edge-wised manners associated with related integrals involved in solving laminar Navier-Stokes equations,in which the inviscid and viscous flux terms are computed by the local lax-Friedrichs scheme and the second scheme of Bassi&Rebay,respectively.A strong stability preserving Runge-Kutta scheme is then used for time marching of numerical solutions.The resulting GPU-accelerated DG method is first validated by the traditional Couette flow problems with different mesh sizes associated with different orders of approximation,which shows that the orders of convergence,as expected,can be achieved.The numerical simulations of the typical flows over a circular cylinder or a NACA 0012 airfoil are then carried out,and the results are further compared with the analytical solutions or available experimental and numerical values reported in the literature,as well as with a performance analysis of the developed code in terms of GPU speedups.This shows that the costs of computing time of the presented test cases are significantly reduced without losing accuracy,while impressive speedups up to 69.7 times are achieved by the present method in comparison to its CPU counterpart.展开更多
Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, r...Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model(TFM) and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit(GPU) has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture(CUDA) platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.展开更多
Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorit...Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorithm has the advantages of the frequency domain and the time domain algorithms at the same time in terms of high computational accuracy and considerable computational efficiency.In addition,the compute unified device architecture(CUDA)acceleration technique also can be employed to further enhance its simulation efficiency.Numerical examples are carried out to demonstrate the effectiveness of the proposed algorithm.The results indicate that the multipactor threshold can be accurately predicted and the computational efficiency can be improved.展开更多
We propose a low-cost and high-damage-threshold phase control system that employs a piezoelectric ceramic transducer modulator controlled by a stochastic parallel gradient descent algorithm. Efficient phase locking of...We propose a low-cost and high-damage-threshold phase control system that employs a piezoelectric ceramic transducer modulator controlled by a stochastic parallel gradient descent algorithm. Efficient phase locking of two fiber amplifiers is demonstrated. Experimental results show that energy encircled in the target pinhole is increased by a factor of 1.76 and the visibility of the fringe pattern is as high as 90% when the system is in close-loop. The phase control system has potential in phase locking of large-number and high-power fiber laser endeavors.展开更多
In a recent paper [Yan F L et al. Chin.Phys.Lett. 25(2008)1187], a quantum secret sharing the protocol between multiparty and multiparty with single photons and unitary transformations was presented. We analyze the ...In a recent paper [Yan F L et al. Chin.Phys.Lett. 25(2008)1187], a quantum secret sharing the protocol between multiparty and multiparty with single photons and unitary transformations was presented. We analyze the security of the protocol and find that a dishonest participant can eavesdrop the key by using a special attack. Finally, we give a description of this strategy and put forward an improved version of this protocol which can stand against this kind of attack.展开更多
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul...Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.展开更多
Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of ...Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases.展开更多
A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary netw...A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.展开更多
The existing theory of decoy-state quantum cryptography assumes that the dark count rate is a constant, but in practice there exists fluctuation. We develop a new scheme of the decoy state, achieve a more practical ke...The existing theory of decoy-state quantum cryptography assumes that the dark count rate is a constant, but in practice there exists fluctuation. We develop a new scheme of the decoy state, achieve a more practical key generation rate in the presence of fluctuation of the dark count rate, and compare the result with the result of the decoy-state without fluctuation. It is found that the key generation rate and maximal secure distance will be decreased under the influence of the fluctuation of the dark count rate.展开更多
A thin TiO2 layer inserted in a phase change memory (PCM) cell to form a deep sub-micro bottom electrode (DBE) is proposed and its electro-thermal characteristics are investigated with the three-dimensional finite...A thin TiO2 layer inserted in a phase change memory (PCM) cell to form a deep sub-micro bottom electrode (DBE) is proposed and its electro-thermal characteristics are investigated with the three-dimensional finite element analysis. Compared with the conventional PCM cell with a SiN stop layer, the reset threshold current of the PCM cell with the TiO2 layer is reduced from 1.8 mA to 1.2 mA and the ratio of the amorphous resistance and crystalline resistive increases from 65 to 100. The optimum thickness of the TiO2 layer and the optimum height of DBE are 10nm and 200nm, respectively. Therefore, the PCM cell with the TiO2 layer can decrease the programming power consumption and increase heating efficiency. The TiO2 film is a better candidate for the SiN film in the PCM cell structure to prepare DBE and to reduce programming power in the reset operation.展开更多
We recently proposed a flexible quantum secure direct communication protocol [Chin. Phys. Lett. 23 (2006) 3152]. By analyzing its security in the perfect channel from the aspect of quantum information theory, we fin...We recently proposed a flexible quantum secure direct communication protocol [Chin. Phys. Lett. 23 (2006) 3152]. By analyzing its security in the perfect channel from the aspect of quantum information theory, we find that an eavesdropper is capable of stealing all the information without being detected. Two typical attacks are presented to illustrate this point. A solution to this loophole is also suggested and we show its powerfulness against the most general individual attack in the ideal case. We also discuss the security in the imperfect case when there is noise and loss.展开更多
In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a...In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.展开更多
With the rapid development of mobile technology and smart devices,crowdsensing has shown its large potential to collect massive data.Considering the limitation of calculation power,edge computing is introduced to rele...With the rapid development of mobile technology and smart devices,crowdsensing has shown its large potential to collect massive data.Considering the limitation of calculation power,edge computing is introduced to release unnecessary data transmission.In edge-computing-enabled crowdsensing,massive data is required to be preliminary processed by edge computing devices(ECDs).Compared with the traditional central platform,these ECDs are limited by their own capability so they may only obtain part of relative factors and they can’t process data synthetically.ECDs involved in one task are required to cooperate to process the task data.The privacy of participants is important in crowdsensing,so blockchain is used due to its decentralization and tamperresistance.In crowdsensing tasks,it is usually difficult to obtain the assessment criteria in advance so reinforcement learning is introduced.As mentioned before,ECDs can’t process task data comprehensively and they are required to cooperate quality assessment.Therefore,a blockchain-based framework for data quality in edge-computing-enabled crowdsensing(BFEC)is proposed in this paper.DPoR(Delegated Proof of Reputation),which is proposed in our previous work,is improved to be suitable in BFEC.Iteratively,the final result is calculated without revealing the privacy of participants.Experiments on the open datasets Adult,Blog,and Wine Quality show that our new framework outperforms existing methods in executing sensing tasks.展开更多
The scenario simulation analysis of water environmental emergencies is very important for risk prevention and control,and emergency response.To quickly and accurately simulate the transport and diffusion process of hi...The scenario simulation analysis of water environmental emergencies is very important for risk prevention and control,and emergency response.To quickly and accurately simulate the transport and diffusion process of high-intensity pollutants during sudden environmental water pollution events,in this study,a high-precision pollution transport and diffusion model for unstructured grids based on Compute Unified Device Architecture(CUDA)is proposed.The finite volume method of a total variation diminishing limiter with the Kong proposed r-factor is used to reduce numerical diffusion and oscillation errors in the simulation of pollutants under sharp concentration conditions,and graphics processing unit acceleration technology is used to improve computational efficiency.The advection diffusion process of the model is verified numerically using two benchmark cases,and the efficiency of the model is evaluated using an engineering example.The results demonstrate that the model perform well in the simulation of material transport in the presence of sharp concentration.Additionally,it has high computational efficiency.The acceleration ratio is 46 times the single-thread acceleration effect of the original model.The efficiency of the accelerated model meet the requirements of an engineering application,and the rapid early warning and assessment of water pollution accidents is achieved.展开更多
To enable efficient and low-cost automated apple harvesting,this study presented a multi-class instance segmentation model,SCAL(Star-CAA-LADH),which utilizes a single RGB sensor for image acquisition.The model achieve...To enable efficient and low-cost automated apple harvesting,this study presented a multi-class instance segmentation model,SCAL(Star-CAA-LADH),which utilizes a single RGB sensor for image acquisition.The model achieves accurate segmentation of fruits,fruit-bearing branches,and main branches using only a single RGB image,providing comprehensive visual inputs for robotic harvesting.A Star-CAA module was proposed by integrating Star operation with a Context-Anchored Attention mechanism(CAA),enhancing directional sensitivity and multi-scale feature perception.The Backbone and Neck networks were equipped with hierarchically structured SCA-T/F modules to improve the fusion of highand low-level features,resulting in more continuous masks and sharper boundaries.In the Head network,a Segment_LADH module was employed to optimize classification,bounding box regression,and mask generation,thereby improving segmentation accuracy for small and adherent targets.To enhance robustness in adverse weather conditions,a Chain-of-Thought Prompted Adaptive Enhancer(CPA)module was integrated,thereby increasing model resilience in degraded environments.Experimental results demonstrate that SCAL achieves 94.9%AP_M and 95.1%mAP_M,outperforming YOLOv11s by 6.6%and 4.6%,respectively.Under multi-weather testing conditions,the CPA-SCAL variant consistently outperforms other comparison models in accuracy.After INT8 quantization,the model size was reduced to 14.5 MB,with an inference speed of 47.2 frames per second(fps)on the NVIDIA Jetson AGX Xavier.Experiments conducted in simulated orchard environments validate the effectiveness and generalization capabilities of the SCAL model,demonstrating its suitability as an efficient and comprehensive visual solution for intelligent harvesting in complex agricultural settings.展开更多
基金supported by the National Natural Science Foundation of China (No.11172134)the Funding of Jiangsu Innovation Program for Graduate Education (No.CXLX13_132)
文摘Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.
基金supported by a Grant-in-Aid for Scientific Research on Innovation Areas "Molecular Robotics"(No.24104004) of the Ministry of Education,Culture,Sports,Science,and Technology,Japan
文摘A microtubule gliding assay is a biological experiment observing the dynamics of microtubules driven by motor proteins fixed on a glass surface. When appropriate microtubule interactions are set up on gliding assay experiments, microtubules often organize and create higher-level dynamics such as ring and bundle structures. In order to reproduce such higher-level dynamics on computers, we have been focusing on making a real-time 3D microtubule simulation. This real-time 3D microtubule simulation enables us to gain more knowledge on microtubule dynamics and their swarm movements by means of adjusting simulation paranleters in a real-time fashion. One of the technical challenges when creating a real-time 3D simulation is balancing the 3D rendering and the computing performance. Graphics processor unit (GPU) programming plays an essential role in balancing the millions of tasks, and makes this real-time 3D simulation possible. By the use of general-purpose computing on graphics processing units (GPGPU) programming we are able to run the simulation in a massively parallel fashion, even when dealing with more complex interactions between microtubules such as overriding and snuggling. Due to performance being an important factor, a performance n, odel has also been constructed from the analysis of the microtubule simulation and it is consistent with the performance measurements on different GPGPU architectures with regards to the number of cores and clock cycles.
文摘Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.
基金the National Key Research and Development Program(2021YFA0716400)the National Natural Science Foundation of China(62225405,62350002,61991443,62127814,62235005,and 61927811)the Collaborative Innovation Center of Solid-State Lighting and Energy-Saving Electronics。
文摘The rapid development of internet of things(loT)urgently needs edge miniaturized computing devices with high efficiency and low-power consumption.In-sensor computing has emerged as a promising technology to enable in-situ data processing within the sensor array.Here,we report an optoelectronic array for in-sensor computing by integrating photodiodes(PDs)with resistive random-access memories(RRAMs).The PD-RRAM unit cell exhibits reconfigurable optoelectronic output and photo-responsivity by programming RRAMs into different resistance states.Furthermore,a 3×3 PD-RRAM array is fabricated to demonstrate optical image recognition,achieving a universal architecture with ultralow latency and low power consumption.This study highlights the great potential of the PD-RRAM optoelectronic array as an energy-effcient in-sensor computing primitive for future IoT applications.
基金supported by National Key Research and Development Program of China[grant num-ber 2021YFB3900904]the National Natural Science Foundation of China[grant numbers 42071368,U2033216,41871287].
文摘As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalent in today’s digital world.In this study,we propose two high-performance R solutions for GWR via Multi-core Parallel(MP)and Compute Unified Device Architecture(CUDA)techniques,respectively GWR-MP and GWR-CUDA.We compared GWR-MP and GWR-CUDA with three existing solutions available in Geographically Weighted Models(GWmodel),Multi-scale GWR(MGWR)and Fast GWR(FastGWR).Results showed that all five solutions perform differently across varying sample sizes,with no single solution a clear winner in terms of computational efficiency.Specifically,solutions given in GWmodel and MGWR provided acceptable computational costs for GWR studies with a relatively small sample size.For a large sample size,GWR-MP and FastGWR provided coherent solutions on a Personal Computer(PC)with a common multi-core configuration,GWR-MP provided more efficient computing capacity for each core or thread than FastGWR.For cases when the sample size was very large,and for these cases only,GWR-CUDA provided the most efficient solution,but should note its I/O cost with small samples.In summary,GWR-MP and GWR-CUDA provided complementary high-performance R solutions to existing ones,where for certain data-rich GWR studies,they should be preferred.
基金partially supported by the National Natural Science Foundation of China(No.11972189)the Natural Science Foundation of Jiangsu Province(No.BK20190391)+1 种基金the Natural Science Foundation of Anhui Province(No.1908085QF260)the Priority Academic Program Development of Jiangsu Higher Education Institutions。
文摘A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU speedup through programming under the compute unified device architecture(CUDA)model.The CUDA kernel subroutines are designed to meet with the requirement of high order computing of DG method.The corresponding data structures are constructed in component-wised manners and the thread hierarchy is manipulated in cell-wised or edge-wised manners associated with related integrals involved in solving laminar Navier-Stokes equations,in which the inviscid and viscous flux terms are computed by the local lax-Friedrichs scheme and the second scheme of Bassi&Rebay,respectively.A strong stability preserving Runge-Kutta scheme is then used for time marching of numerical solutions.The resulting GPU-accelerated DG method is first validated by the traditional Couette flow problems with different mesh sizes associated with different orders of approximation,which shows that the orders of convergence,as expected,can be achieved.The numerical simulations of the typical flows over a circular cylinder or a NACA 0012 airfoil are then carried out,and the results are further compared with the analytical solutions or available experimental and numerical values reported in the literature,as well as with a performance analysis of the developed code in terms of GPU speedups.This shows that the costs of computing time of the presented test cases are significantly reduced without losing accuracy,while impressive speedups up to 69.7 times are achieved by the present method in comparison to its CPU counterpart.
基金supported by the National Natural Science Foundation of China(51575304)
文摘Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model(TFM) and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit(GPU) has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture(CUDA) platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.
基金This work was supported by the National Natural Science Foundation of China(61571022,61971022)the National Key laboratory Foundation(HTKJ2019KI504013,61424020305).
文摘Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorithm has the advantages of the frequency domain and the time domain algorithms at the same time in terms of high computational accuracy and considerable computational efficiency.In addition,the compute unified device architecture(CUDA)acceleration technique also can be employed to further enhance its simulation efficiency.Numerical examples are carried out to demonstrate the effectiveness of the proposed algorithm.The results indicate that the multipactor threshold can be accurately predicted and the computational efficiency can be improved.
文摘We propose a low-cost and high-damage-threshold phase control system that employs a piezoelectric ceramic transducer modulator controlled by a stochastic parallel gradient descent algorithm. Efficient phase locking of two fiber amplifiers is demonstrated. Experimental results show that energy encircled in the target pinhole is increased by a factor of 1.76 and the visibility of the fringe pattern is as high as 90% when the system is in close-loop. The phase control system has potential in phase locking of large-number and high-power fiber laser endeavors.
基金Supported by the National Natural Science Foundation of China under Grant Nos 60873191, 60903152 and 60821001, the SRFDP under Grant No 200800131016, Beijing Nova Program under Grant No 2008B51, Key Project of the Ministry of Education of China under Grant No 109014, China Postdoctoral Science Foundation under Grant No 20090450018, Fujian Provincial Natural Science Foundation under Grant No 2008J0013, and the Foundation of Fujian Education Bureau under Grant No 3A08044.
文摘In a recent paper [Yan F L et al. Chin.Phys.Lett. 25(2008)1187], a quantum secret sharing the protocol between multiparty and multiparty with single photons and unitary transformations was presented. We analyze the security of the protocol and find that a dishonest participant can eavesdrop the key by using a special attack. Finally, we give a description of this strategy and put forward an improved version of this protocol which can stand against this kind of attack.
基金supported by College of William and Mary,Virginia Institute of Marine Science for the study environment
文摘Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.
基金the National Basic Research Program(973) of China(No.2010CB834300)the Biomedical Engineering Cross-Research Fund of Shanghai Jiao Tong University(Nos.YG2011MS49 and YG2013MS65)
文摘Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases.
基金National Natural Science Foundation of China (No. 60975084)Natural Science Foundation of Fujian Province,China (No.2011J05159)
文摘A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.
基金Supported by the National Natural Science Foundation of China under Grant No 10504042.
文摘The existing theory of decoy-state quantum cryptography assumes that the dark count rate is a constant, but in practice there exists fluctuation. We develop a new scheme of the decoy state, achieve a more practical key generation rate in the presence of fluctuation of the dark count rate, and compare the result with the result of the decoy-state without fluctuation. It is found that the key generation rate and maximal secure distance will be decreased under the influence of the fluctuation of the dark count rate.
基金Supported by the National Basic Research Program of China (2007CB935400 and 2006CB302700), the National High Technology Research and Development Program of China (2008AA031402), Science and Technology Council of Shanghai (0752nm013, 07QA14065, 07SA08, 08DZ2200700, 08JC1421700), the National Nature Science Foundation of China (60776058), and Chinese Academy of Sciences (083YQA1001)
文摘A thin TiO2 layer inserted in a phase change memory (PCM) cell to form a deep sub-micro bottom electrode (DBE) is proposed and its electro-thermal characteristics are investigated with the three-dimensional finite element analysis. Compared with the conventional PCM cell with a SiN stop layer, the reset threshold current of the PCM cell with the TiO2 layer is reduced from 1.8 mA to 1.2 mA and the ratio of the amorphous resistance and crystalline resistive increases from 65 to 100. The optimum thickness of the TiO2 layer and the optimum height of DBE are 10nm and 200nm, respectively. Therefore, the PCM cell with the TiO2 layer can decrease the programming power consumption and increase heating efficiency. The TiO2 film is a better candidate for the SiN film in the PCM cell structure to prepare DBE and to reduce programming power in the reset operation.
文摘We recently proposed a flexible quantum secure direct communication protocol [Chin. Phys. Lett. 23 (2006) 3152]. By analyzing its security in the perfect channel from the aspect of quantum information theory, we find that an eavesdropper is capable of stealing all the information without being detected. Two typical attacks are presented to illustrate this point. A solution to this loophole is also suggested and we show its powerfulness against the most general individual attack in the ideal case. We also discuss the security in the imperfect case when there is noise and loss.
基金Supported by National Nature Science Foundation of China(61472289)the Nature Science Foundation of Hubei Province(2015CFB254)
文摘In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.
基金supported by the Key Science and Technology Project of Henan Province(201300210400)National Key Research and Development Project(2018YFB1800304)+1 种基金National Natural Science Foundation of China(61762058),Fundamental Research Funds for the Central Universities(xzy012020112)Natural Science Foundation of Gansu Province(21JR7RA282).
文摘With the rapid development of mobile technology and smart devices,crowdsensing has shown its large potential to collect massive data.Considering the limitation of calculation power,edge computing is introduced to release unnecessary data transmission.In edge-computing-enabled crowdsensing,massive data is required to be preliminary processed by edge computing devices(ECDs).Compared with the traditional central platform,these ECDs are limited by their own capability so they may only obtain part of relative factors and they can’t process data synthetically.ECDs involved in one task are required to cooperate to process the task data.The privacy of participants is important in crowdsensing,so blockchain is used due to its decentralization and tamperresistance.In crowdsensing tasks,it is usually difficult to obtain the assessment criteria in advance so reinforcement learning is introduced.As mentioned before,ECDs can’t process task data comprehensively and they are required to cooperate quality assessment.Therefore,a blockchain-based framework for data quality in edge-computing-enabled crowdsensing(BFEC)is proposed in this paper.DPoR(Delegated Proof of Reputation),which is proposed in our previous work,is improved to be suitable in BFEC.Iteratively,the final result is calculated without revealing the privacy of participants.Experiments on the open datasets Adult,Blog,and Wine Quality show that our new framework outperforms existing methods in executing sensing tasks.
基金supported by the National Key Research and Development Program of China(Grant No.2022YFC3202004)the National Natural Science Foundation of China(Grant No.51979105).
文摘The scenario simulation analysis of water environmental emergencies is very important for risk prevention and control,and emergency response.To quickly and accurately simulate the transport and diffusion process of high-intensity pollutants during sudden environmental water pollution events,in this study,a high-precision pollution transport and diffusion model for unstructured grids based on Compute Unified Device Architecture(CUDA)is proposed.The finite volume method of a total variation diminishing limiter with the Kong proposed r-factor is used to reduce numerical diffusion and oscillation errors in the simulation of pollutants under sharp concentration conditions,and graphics processing unit acceleration technology is used to improve computational efficiency.The advection diffusion process of the model is verified numerically using two benchmark cases,and the efficiency of the model is evaluated using an engineering example.The results demonstrate that the model perform well in the simulation of material transport in the presence of sharp concentration.Additionally,it has high computational efficiency.The acceleration ratio is 46 times the single-thread acceleration effect of the original model.The efficiency of the accelerated model meet the requirements of an engineering application,and the rapid early warning and assessment of water pollution accidents is achieved.
基金supported by the Qinchuangyuan Project of Shaanxi Province(Grant No.2023KXJ-016).
文摘To enable efficient and low-cost automated apple harvesting,this study presented a multi-class instance segmentation model,SCAL(Star-CAA-LADH),which utilizes a single RGB sensor for image acquisition.The model achieves accurate segmentation of fruits,fruit-bearing branches,and main branches using only a single RGB image,providing comprehensive visual inputs for robotic harvesting.A Star-CAA module was proposed by integrating Star operation with a Context-Anchored Attention mechanism(CAA),enhancing directional sensitivity and multi-scale feature perception.The Backbone and Neck networks were equipped with hierarchically structured SCA-T/F modules to improve the fusion of highand low-level features,resulting in more continuous masks and sharper boundaries.In the Head network,a Segment_LADH module was employed to optimize classification,bounding box regression,and mask generation,thereby improving segmentation accuracy for small and adherent targets.To enhance robustness in adverse weather conditions,a Chain-of-Thought Prompted Adaptive Enhancer(CPA)module was integrated,thereby increasing model resilience in degraded environments.Experimental results demonstrate that SCAL achieves 94.9%AP_M and 95.1%mAP_M,outperforming YOLOv11s by 6.6%and 4.6%,respectively.Under multi-weather testing conditions,the CPA-SCAL variant consistently outperforms other comparison models in accuracy.After INT8 quantization,the model size was reduced to 14.5 MB,with an inference speed of 47.2 frames per second(fps)on the NVIDIA Jetson AGX Xavier.Experiments conducted in simulated orchard environments validate the effectiveness and generalization capabilities of the SCAL model,demonstrating its suitability as an efficient and comprehensive visual solution for intelligent harvesting in complex agricultural settings.