The modern paradigm of the Internet of Things(IoT)has led to a significant increase in demand for latency-sensitive applications in Fog-based cloud computing.However,such applications cannot meet strict quality of ser...The modern paradigm of the Internet of Things(IoT)has led to a significant increase in demand for latency-sensitive applications in Fog-based cloud computing.However,such applications cannot meet strict quality of service(QoS)requirements.The large-scale deployment of IoT requires more effective use of network infrastructure to ensure QoS when processing big data.Generally,cloud-centric IoT application deployment involves different modules running on terminal devices and cloud servers.Fog devices with different computing capabilities must process the data generated by the end device,so deploying latency-sensitive applications in a heterogeneous fog computing environment is a difficult task.In addition,when there is an inconsistent connection delay between the fog and the terminal device,the deployment of such applications becomes more complicated.In this article,we propose an algorithm that can effectively place application modules on network nodes while considering connection delay,processing power,and sensing data volume.Compared with traditional cloud computing deployment,we conducted simulations in iFogSim to confirm the effectiveness of the algorithm.The simulation results verify the effectiveness of the proposed algorithm in terms of end-to-end delay and network consumption.Therein,latency and execution time is insensitive to the number of sensors.展开更多
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N...Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.展开更多
Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremend...Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.展开更多
As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalen...As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalent in today’s digital world.In this study,we propose two high-performance R solutions for GWR via Multi-core Parallel(MP)and Compute Unified Device Architecture(CUDA)techniques,respectively GWR-MP and GWR-CUDA.We compared GWR-MP and GWR-CUDA with three existing solutions available in Geographically Weighted Models(GWmodel),Multi-scale GWR(MGWR)and Fast GWR(FastGWR).Results showed that all five solutions perform differently across varying sample sizes,with no single solution a clear winner in terms of computational efficiency.Specifically,solutions given in GWmodel and MGWR provided acceptable computational costs for GWR studies with a relatively small sample size.For a large sample size,GWR-MP and FastGWR provided coherent solutions on a Personal Computer(PC)with a common multi-core configuration,GWR-MP provided more efficient computing capacity for each core or thread than FastGWR.For cases when the sample size was very large,and for these cases only,GWR-CUDA provided the most efficient solution,but should note its I/O cost with small samples.In summary,GWR-MP and GWR-CUDA provided complementary high-performance R solutions to existing ones,where for certain data-rich GWR studies,they should be preferred.展开更多
A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU...A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU speedup through programming under the compute unified device architecture(CUDA)model.The CUDA kernel subroutines are designed to meet with the requirement of high order computing of DG method.The corresponding data structures are constructed in component-wised manners and the thread hierarchy is manipulated in cell-wised or edge-wised manners associated with related integrals involved in solving laminar Navier-Stokes equations,in which the inviscid and viscous flux terms are computed by the local lax-Friedrichs scheme and the second scheme of Bassi&Rebay,respectively.A strong stability preserving Runge-Kutta scheme is then used for time marching of numerical solutions.The resulting GPU-accelerated DG method is first validated by the traditional Couette flow problems with different mesh sizes associated with different orders of approximation,which shows that the orders of convergence,as expected,can be achieved.The numerical simulations of the typical flows over a circular cylinder or a NACA 0012 airfoil are then carried out,and the results are further compared with the analytical solutions or available experimental and numerical values reported in the literature,as well as with a performance analysis of the developed code in terms of GPU speedups.This shows that the costs of computing time of the presented test cases are significantly reduced without losing accuracy,while impressive speedups up to 69.7 times are achieved by the present method in comparison to its CPU counterpart.展开更多
Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, r...Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model(TFM) and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit(GPU) has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture(CUDA) platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.展开更多
The study of induced polarization (IP) information extraction from magnetotelluric (MT) sounding data is of great and practical significance to the exploitation of deep mineral, oil and gas resources. The linear i...The study of induced polarization (IP) information extraction from magnetotelluric (MT) sounding data is of great and practical significance to the exploitation of deep mineral, oil and gas resources. The linear inversion method, which has been given priority in previous research on the IP information extraction method, has three main problems as follows: 1) dependency on the initial model, 2) easily falling into the local minimum, and 3) serious non-uniqueness of solutions. Taking the nonlinearity and nonconvexity of IP information extraction into consideration, a two-stage CO-PSO minimum structure inversion method using compute unified distributed architecture (CUDA) is proposed. On one hand, a novel Cauchy oscillation particle swarm optimization (CO-PSO) algorithm is applied to extract nonlinear IP information from MT sounding data, which is implemented as a parallel algorithm within CUDA computing architecture; on the other hand, the impact of the polarizability on the observation data is strengthened by introducing a second stage inversion process, and the regularization parameter is applied in the fitness function of PSO algorithm to solve the problem of multi-solution in inversion. The inversion simulation results of polarization layers in different strata of various geoelectric models show that the smooth models of resistivity and IP parameters can be obtained by the proposed algorithm, the results of which are relatively stable and accurate. The experiment results added with noise indicate that this method is robust to Gaussian white noise. Compared with the traditional PSO and GA algorithm, the proposed algorithm has more efficiency and better inversion results.展开更多
Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorit...Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorithm has the advantages of the frequency domain and the time domain algorithms at the same time in terms of high computational accuracy and considerable computational efficiency.In addition,the compute unified device architecture(CUDA)acceleration technique also can be employed to further enhance its simulation efficiency.Numerical examples are carried out to demonstrate the effectiveness of the proposed algorithm.The results indicate that the multipactor threshold can be accurately predicted and the computational efficiency can be improved.展开更多
Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of ...Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases.展开更多
A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary netw...A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.展开更多
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul...Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.展开更多
The evolution of expert and knowledge-based systems in architecture requires the gradual population of building specific databases. Often these databases are slow to evolve due to the time consuming nature of effectiv...The evolution of expert and knowledge-based systems in architecture requires the gradual population of building specific databases. Often these databases are slow to evolve due to the time consuming nature of effectively categorizing building features in a meaningful way that allows for retrieval and reuse. New advances in artificial intelligence such as Hierarchical Temporal Memory (HTM) have the potential to make the construction of these databases more realistic in the near future. Based on an emerging theory of human neurological function, HTMs excel at ambiguous pattern recognition. This paper includes a first experiment using HTMs for learning and recognizing patterns in the form of two distinct American house plan typologies, and further tests the relationship of HTM's recognition tendencies in alternate house plan types. Results from the experiment indicate that HTMs develop a similar storage of quality to humans and are therefore a promising option for capturing multi-modal information in future design automation efforts.展开更多
In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a...In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.展开更多
The scenario simulation analysis of water environmental emergencies is very important for risk prevention and control,and emergency response.To quickly and accurately simulate the transport and diffusion process of hi...The scenario simulation analysis of water environmental emergencies is very important for risk prevention and control,and emergency response.To quickly and accurately simulate the transport and diffusion process of high-intensity pollutants during sudden environmental water pollution events,in this study,a high-precision pollution transport and diffusion model for unstructured grids based on Compute Unified Device Architecture(CUDA)is proposed.The finite volume method of a total variation diminishing limiter with the Kong proposed r-factor is used to reduce numerical diffusion and oscillation errors in the simulation of pollutants under sharp concentration conditions,and graphics processing unit acceleration technology is used to improve computational efficiency.The advection diffusion process of the model is verified numerically using two benchmark cases,and the efficiency of the model is evaluated using an engineering example.The results demonstrate that the model perform well in the simulation of material transport in the presence of sharp concentration.Additionally,it has high computational efficiency.The acceleration ratio is 46 times the single-thread acceleration effect of the original model.The efficiency of the accelerated model meet the requirements of an engineering application,and the rapid early warning and assessment of water pollution accidents is achieved.展开更多
Organic electrochemical transistors(OECTs),essential components in bioelectronics,serve as a bridge between biological systems and electronic interfaces by converting ionic signals into electronic currents,making them...Organic electrochemical transistors(OECTs),essential components in bioelectronics,serve as a bridge between biological systems and electronic interfaces by converting ionic signals into electronic currents,making them crucial for applications like implantable biosensors,wearable health monitors,and neuromorphic computing architectures[1,2].Despite their ability to bind directly to biological fluids and tissues,and excellent conformal interfaces with dynamic surfaces such as human skin,due to repeated electrochemical cycling,exposure to environmental factors,and parasitic reactions,OECTs still face persistent stability issues that often manifest as hysteresis in device performance,continuously limiting their potential for long-term bioelectronic applications[3,4].Therefore,addressing this instability is crucial to unlocking the full potential of OECTs in chronic medical monitoring,adaptive biohybrid systems,and energy-efficient neuromorphic hardware.展开更多
The synchronization of nonlinear systems has been demonstrated in several natural systems,which not only enhances the performance of spin torque oscillators(STOs)but also enables the modification of STOs for new compu...The synchronization of nonlinear systems has been demonstrated in several natural systems,which not only enhances the performance of spin torque oscillators(STOs)but also enables the modification of STOs for new computing architectures.This paper reviews recent advances in the mutual synchronization,forced synchronization,and noise synchronization of STOs from both theoretical and experimental perspectives.The main types of synchronization discussed include spin wave synchronization,dipolar field synchronization,electrical connection synchronization,and injection locking.After introducing the theoretical and experimental progress in these fields,we highlight the importance of synchronization for practical applications in both microwave devices and neuromorphic computing.The significance of these studies for understanding and applying STO synchronization is emphasized,and we offer our perspective on current research,suggesting directions for future studies.展开更多
This study presents a parallel version of the string matching algorithms research tool(SMART)library,implemented on NVIDIA’s compute unified device architecture(CUDA)platform,and uses general-purpose computing on gra...This study presents a parallel version of the string matching algorithms research tool(SMART)library,implemented on NVIDIA’s compute unified device architecture(CUDA)platform,and uses general-purpose computing on graphics processing unit(GPGPU)programming concepts to enhance performance and gain insight into the parallel versions of these algorithms.We have developed the CUDA-enhanced SMART(CUSMART)library,which incorporates parallelized iterations of 64 string matching algorithms,leveraging the CUDA application programming interface.The performance of these algorithms has been assessed across various scenarios to ensure a comprehensive and impartial comparison,allowing for the identification of their strengths and weaknesses in specific application contexts.We have explored and established optimization techniques to gauge their influence on the performance of these algorithms.The results of this study highlight the potential of GPGPU computing in string matching applications through the scalability of algorithms,suggesting significant performance improvements.Furthermore,we have identified the best and worst performing algorithms in various scenarios.展开更多
Cloud monitoring is of a source of big data that are constantly produced from traces of infrastructures,platforms, and applications. Analysis of monitoring data delivers insights of the system's workload and usage pa...Cloud monitoring is of a source of big data that are constantly produced from traces of infrastructures,platforms, and applications. Analysis of monitoring data delivers insights of the system's workload and usage pattern and ensures workloads are operating at optimum levels. The analysis process involves data query and extraction, data analysis, and result visualization. Since the volume of monitoring data is big, these operations require a scalable and reliable architecture to extract, aggregate, and analyze data in an arbitrary range of granularity. Ultimately, the results of analysis become the knowledge of the system and should be shared and communicated. This paper presents our cloud service architecture that explores a search cluster for data indexing and query. We develop REST APIs that the data can be accessed by different analysis modules. This architecture enables extensions to integrate with software frameworks of both batch processing(such as Hadoop) and stream processing(such as Spark) of big data. The analysis results are structured in Semantic Media Wiki pages in the context of the monitoring data source and the analysis process. This cloud architecture is empirically assessed to evaluate its responsiveness when processing a large set of data records under node failures.展开更多
Objective The MicroTCA.4(MTCA.4)standard systems have been widely used in large-scale scientific facilities such as synchrotron radiation light sources and FELs over the world,covering RF control,beam instrumentation,...Objective The MicroTCA.4(MTCA.4)standard systems have been widely used in large-scale scientific facilities such as synchrotron radiation light sources and FELs over the world,covering RF control,beam instrumentation,timing,machine protection,and so on.The MTCA.4 module management controller(MMC)realizes intelligent management of the boards in the chassis through bus protocol and system interaction.It is an important functional module in MTCA.4 standard system.Methods In order to meet the requirements of the large scientific facilities,an MMC module was designed and developed.This design can realize power management of Advanced Mezzanine Card(AMC)and Rear Transition Module(RTM)boards,as well as monitoring the temperature,voltage,and current during operation.The core part of this module is limited into an area of 3 cm 3 cm on the AMC board,leaving large space for subsequent development of functional circuit.Results An AMC board was developed to verify functions of the MMC.Test results indicate that this board is compatible with existing MTCA.4 standard system.Conclusions This MMC solution can be directly and modularly applied to the design of MTCA.4 standard hardware.展开更多
The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific f...The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data, so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like Tuple Space Search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms, including Graphics Processing Units (GPUs), clusters of Central Processing Units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 Million packets per second (Mpps). That is, the hybrid cluster produced by the integration of Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.展开更多
基金This research was supported by the MSIT(Ministry of Science and ICT),Korea,under the ITRC(Information Technology Research Center)support program(IITP-2021-2016-0-00313)supervised by the IITP(Institute for Information&Communications Technology Planning&Evaluation).
文摘The modern paradigm of the Internet of Things(IoT)has led to a significant increase in demand for latency-sensitive applications in Fog-based cloud computing.However,such applications cannot meet strict quality of service(QoS)requirements.The large-scale deployment of IoT requires more effective use of network infrastructure to ensure QoS when processing big data.Generally,cloud-centric IoT application deployment involves different modules running on terminal devices and cloud servers.Fog devices with different computing capabilities must process the data generated by the end device,so deploying latency-sensitive applications in a heterogeneous fog computing environment is a difficult task.In addition,when there is an inconsistent connection delay between the fog and the terminal device,the deployment of such applications becomes more complicated.In this article,we propose an algorithm that can effectively place application modules on network nodes while considering connection delay,processing power,and sensing data volume.Compared with traditional cloud computing deployment,we conducted simulations in iFogSim to confirm the effectiveness of the algorithm.The simulation results verify the effectiveness of the proposed algorithm in terms of end-to-end delay and network consumption.Therein,latency and execution time is insensitive to the number of sensors.
基金supported by the National Natural Science Foundation of China (No.11172134)the Funding of Jiangsu Innovation Program for Graduate Education (No.CXLX13_132)
文摘Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.
文摘Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.
基金supported by National Key Research and Development Program of China[grant num-ber 2021YFB3900904]the National Natural Science Foundation of China[grant numbers 42071368,U2033216,41871287].
文摘As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalent in today’s digital world.In this study,we propose two high-performance R solutions for GWR via Multi-core Parallel(MP)and Compute Unified Device Architecture(CUDA)techniques,respectively GWR-MP and GWR-CUDA.We compared GWR-MP and GWR-CUDA with three existing solutions available in Geographically Weighted Models(GWmodel),Multi-scale GWR(MGWR)and Fast GWR(FastGWR).Results showed that all five solutions perform differently across varying sample sizes,with no single solution a clear winner in terms of computational efficiency.Specifically,solutions given in GWmodel and MGWR provided acceptable computational costs for GWR studies with a relatively small sample size.For a large sample size,GWR-MP and FastGWR provided coherent solutions on a Personal Computer(PC)with a common multi-core configuration,GWR-MP provided more efficient computing capacity for each core or thread than FastGWR.For cases when the sample size was very large,and for these cases only,GWR-CUDA provided the most efficient solution,but should note its I/O cost with small samples.In summary,GWR-MP and GWR-CUDA provided complementary high-performance R solutions to existing ones,where for certain data-rich GWR studies,they should be preferred.
基金partially supported by the National Natural Science Foundation of China(No.11972189)the Natural Science Foundation of Jiangsu Province(No.BK20190391)+1 种基金the Natural Science Foundation of Anhui Province(No.1908085QF260)the Priority Academic Program Development of Jiangsu Higher Education Institutions。
文摘A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU speedup through programming under the compute unified device architecture(CUDA)model.The CUDA kernel subroutines are designed to meet with the requirement of high order computing of DG method.The corresponding data structures are constructed in component-wised manners and the thread hierarchy is manipulated in cell-wised or edge-wised manners associated with related integrals involved in solving laminar Navier-Stokes equations,in which the inviscid and viscous flux terms are computed by the local lax-Friedrichs scheme and the second scheme of Bassi&Rebay,respectively.A strong stability preserving Runge-Kutta scheme is then used for time marching of numerical solutions.The resulting GPU-accelerated DG method is first validated by the traditional Couette flow problems with different mesh sizes associated with different orders of approximation,which shows that the orders of convergence,as expected,can be achieved.The numerical simulations of the typical flows over a circular cylinder or a NACA 0012 airfoil are then carried out,and the results are further compared with the analytical solutions or available experimental and numerical values reported in the literature,as well as with a performance analysis of the developed code in terms of GPU speedups.This shows that the costs of computing time of the presented test cases are significantly reduced without losing accuracy,while impressive speedups up to 69.7 times are achieved by the present method in comparison to its CPU counterpart.
基金supported by the National Natural Science Foundation of China(51575304)
文摘Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model(TFM) and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit(GPU) has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture(CUDA) platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.
基金Projects(41604117,41204054)supported by the National Natural Science Foundation of ChinaProjects(20110490149,2015M580700)supported by the Research Fund for the Doctoral Program of Higher Education,China+1 种基金Project(2015zzts064)supported by the Fundamental Research Funds for the Central Universities,ChinaProject(16B147)supported by the Scientific Research Fund of Hunan Provincial Education Department,China
文摘The study of induced polarization (IP) information extraction from magnetotelluric (MT) sounding data is of great and practical significance to the exploitation of deep mineral, oil and gas resources. The linear inversion method, which has been given priority in previous research on the IP information extraction method, has three main problems as follows: 1) dependency on the initial model, 2) easily falling into the local minimum, and 3) serious non-uniqueness of solutions. Taking the nonlinearity and nonconvexity of IP information extraction into consideration, a two-stage CO-PSO minimum structure inversion method using compute unified distributed architecture (CUDA) is proposed. On one hand, a novel Cauchy oscillation particle swarm optimization (CO-PSO) algorithm is applied to extract nonlinear IP information from MT sounding data, which is implemented as a parallel algorithm within CUDA computing architecture; on the other hand, the impact of the polarizability on the observation data is strengthened by introducing a second stage inversion process, and the regularization parameter is applied in the fitness function of PSO algorithm to solve the problem of multi-solution in inversion. The inversion simulation results of polarization layers in different strata of various geoelectric models show that the smooth models of resistivity and IP parameters can be obtained by the proposed algorithm, the results of which are relatively stable and accurate. The experiment results added with noise indicate that this method is robust to Gaussian white noise. Compared with the traditional PSO and GA algorithm, the proposed algorithm has more efficiency and better inversion results.
基金This work was supported by the National Natural Science Foundation of China(61571022,61971022)the National Key laboratory Foundation(HTKJ2019KI504013,61424020305).
文摘Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorithm has the advantages of the frequency domain and the time domain algorithms at the same time in terms of high computational accuracy and considerable computational efficiency.In addition,the compute unified device architecture(CUDA)acceleration technique also can be employed to further enhance its simulation efficiency.Numerical examples are carried out to demonstrate the effectiveness of the proposed algorithm.The results indicate that the multipactor threshold can be accurately predicted and the computational efficiency can be improved.
基金the National Basic Research Program(973) of China(No.2010CB834300)the Biomedical Engineering Cross-Research Fund of Shanghai Jiao Tong University(Nos.YG2011MS49 and YG2013MS65)
文摘Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases.
基金National Natural Science Foundation of China (No. 60975084)Natural Science Foundation of Fujian Province,China (No.2011J05159)
文摘A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.
基金supported by College of William and Mary,Virginia Institute of Marine Science for the study environment
文摘Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.
文摘The evolution of expert and knowledge-based systems in architecture requires the gradual population of building specific databases. Often these databases are slow to evolve due to the time consuming nature of effectively categorizing building features in a meaningful way that allows for retrieval and reuse. New advances in artificial intelligence such as Hierarchical Temporal Memory (HTM) have the potential to make the construction of these databases more realistic in the near future. Based on an emerging theory of human neurological function, HTMs excel at ambiguous pattern recognition. This paper includes a first experiment using HTMs for learning and recognizing patterns in the form of two distinct American house plan typologies, and further tests the relationship of HTM's recognition tendencies in alternate house plan types. Results from the experiment indicate that HTMs develop a similar storage of quality to humans and are therefore a promising option for capturing multi-modal information in future design automation efforts.
基金Supported by National Nature Science Foundation of China(61472289)the Nature Science Foundation of Hubei Province(2015CFB254)
文摘In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.
基金supported by the National Key Research and Development Program of China(Grant No.2022YFC3202004)the National Natural Science Foundation of China(Grant No.51979105).
文摘The scenario simulation analysis of water environmental emergencies is very important for risk prevention and control,and emergency response.To quickly and accurately simulate the transport and diffusion process of high-intensity pollutants during sudden environmental water pollution events,in this study,a high-precision pollution transport and diffusion model for unstructured grids based on Compute Unified Device Architecture(CUDA)is proposed.The finite volume method of a total variation diminishing limiter with the Kong proposed r-factor is used to reduce numerical diffusion and oscillation errors in the simulation of pollutants under sharp concentration conditions,and graphics processing unit acceleration technology is used to improve computational efficiency.The advection diffusion process of the model is verified numerically using two benchmark cases,and the efficiency of the model is evaluated using an engineering example.The results demonstrate that the model perform well in the simulation of material transport in the presence of sharp concentration.Additionally,it has high computational efficiency.The acceleration ratio is 46 times the single-thread acceleration effect of the original model.The efficiency of the accelerated model meet the requirements of an engineering application,and the rapid early warning and assessment of water pollution accidents is achieved.
文摘Organic electrochemical transistors(OECTs),essential components in bioelectronics,serve as a bridge between biological systems and electronic interfaces by converting ionic signals into electronic currents,making them crucial for applications like implantable biosensors,wearable health monitors,and neuromorphic computing architectures[1,2].Despite their ability to bind directly to biological fluids and tissues,and excellent conformal interfaces with dynamic surfaces such as human skin,due to repeated electrochemical cycling,exposure to environmental factors,and parasitic reactions,OECTs still face persistent stability issues that often manifest as hysteresis in device performance,continuously limiting their potential for long-term bioelectronic applications[3,4].Therefore,addressing this instability is crucial to unlocking the full potential of OECTs in chronic medical monitoring,adaptive biohybrid systems,and energy-efficient neuromorphic hardware.
基金supported by the National Key R&D Program of China(Grant No.2022YFB3206800)the National Natural Science Foundation of China(Grant Nos.62271469,62201547,61901440,and 62074164)+3 种基金the Science and Disruptive Technology Program of Aerospace Information Research Institute,Chinese Academy of Sciencesthe One Hundred Person Project of the Chinese Academy of Sciencesthe Young Elite Scientists Sponsorship Program by China Association for Science and Technology(Grant No.YESS20210341)the Xiaomi Young Talents Program。
文摘The synchronization of nonlinear systems has been demonstrated in several natural systems,which not only enhances the performance of spin torque oscillators(STOs)but also enables the modification of STOs for new computing architectures.This paper reviews recent advances in the mutual synchronization,forced synchronization,and noise synchronization of STOs from both theoretical and experimental perspectives.The main types of synchronization discussed include spin wave synchronization,dipolar field synchronization,electrical connection synchronization,and injection locking.After introducing the theoretical and experimental progress in these fields,we highlight the importance of synchronization for practical applications in both microwave devices and neuromorphic computing.The significance of these studies for understanding and applying STO synchronization is emphasized,and we offer our perspective on current research,suggesting directions for future studies.
基金Project supported by the Scientific and Technological Research Council of Türkiye(No.117E142)Open access funding provided by the Scientific and Technological Research Council of Türkiye(TÜBİTAK)。
文摘This study presents a parallel version of the string matching algorithms research tool(SMART)library,implemented on NVIDIA’s compute unified device architecture(CUDA)platform,and uses general-purpose computing on graphics processing unit(GPGPU)programming concepts to enhance performance and gain insight into the parallel versions of these algorithms.We have developed the CUDA-enhanced SMART(CUSMART)library,which incorporates parallelized iterations of 64 string matching algorithms,leveraging the CUDA application programming interface.The performance of these algorithms has been assessed across various scenarios to ensure a comprehensive and impartial comparison,allowing for the identification of their strengths and weaknesses in specific application contexts.We have explored and established optimization techniques to gauge their influence on the performance of these algorithms.The results of this study highlight the potential of GPGPU computing in string matching applications through the scalability of algorithms,suggesting significant performance improvements.Furthermore,we have identified the best and worst performing algorithms in various scenarios.
基金supported by the Discovery grant No.RGPIN 2014-05254 from Natural Science&Engineering Research Council(NSERC),Canada
文摘Cloud monitoring is of a source of big data that are constantly produced from traces of infrastructures,platforms, and applications. Analysis of monitoring data delivers insights of the system's workload and usage pattern and ensures workloads are operating at optimum levels. The analysis process involves data query and extraction, data analysis, and result visualization. Since the volume of monitoring data is big, these operations require a scalable and reliable architecture to extract, aggregate, and analyze data in an arbitrary range of granularity. Ultimately, the results of analysis become the knowledge of the system and should be shared and communicated. This paper presents our cloud service architecture that explores a search cluster for data indexing and query. We develop REST APIs that the data can be accessed by different analysis modules. This architecture enables extensions to integrate with software frameworks of both batch processing(such as Hadoop) and stream processing(such as Spark) of big data. The analysis results are structured in Semantic Media Wiki pages in the context of the monitoring data source and the analysis process. This cloud architecture is empirically assessed to evaluate its responsiveness when processing a large set of data records under node failures.
基金funded by the National Natural Science Foundation of China(No.11675174,No.11805219).Author informa。
文摘Objective The MicroTCA.4(MTCA.4)standard systems have been widely used in large-scale scientific facilities such as synchrotron radiation light sources and FELs over the world,covering RF control,beam instrumentation,timing,machine protection,and so on.The MTCA.4 module management controller(MMC)realizes intelligent management of the boards in the chassis through bus protocol and system interaction.It is an important functional module in MTCA.4 standard system.Methods In order to meet the requirements of the large scientific facilities,an MMC module was designed and developed.This design can realize power management of Advanced Mezzanine Card(AMC)and Rear Transition Module(RTM)boards,as well as monitoring the temperature,voltage,and current during operation.The core part of this module is limited into an area of 3 cm 3 cm on the AMC board,leaving large space for subsequent development of functional circuit.Results An AMC board was developed to verify functions of the MMC.Test results indicate that this board is compatible with existing MTCA.4 standard system.Conclusions This MMC solution can be directly and modularly applied to the design of MTCA.4 standard hardware.
文摘The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data, so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like Tuple Space Search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms, including Graphics Processing Units (GPUs), clusters of Central Processing Units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 Million packets per second (Mpps). That is, the hybrid cluster produced by the integration of Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.