Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N...Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.展开更多
Infrared optoelectronic sensing is the core of many critical applications such as night vision,health and medication,military,space exploration,etc.Further including mechanical flexibility as a new dimension enables n...Infrared optoelectronic sensing is the core of many critical applications such as night vision,health and medication,military,space exploration,etc.Further including mechanical flexibility as a new dimension enables novel features of adaptability and conformability,promising for developing next-generation optoelectronic sensory applications toward reduced size,weight,price,power consumption,and enhanced performance(SWaP^(3)).However,in this emerging research frontier,challenges persist in simultaneously achieving high infrared response and good mechanical deformability in devices and integrated systems.Therefore,we perform a comprehensive review of the design strategies and insights of flexible infrared optoelectronic sensors,including the fundamentals of infrared photodetectors,selection of materials and device architectures,fabrication techniques and design strategies,and the discussion of architectural and functional integration towards applications in wearable optoelectronics and advanced image sensing.Finally,this article offers insights into future directions to practically realize the ultra-high performance and smart sensors enabled by infrared-sensitive materials,covering challenges in materials development and device micro-/nanofabrication.Benchmarks for scaling these techniques across fabrication,performance,and integration are presented,alongside perspectives on potential applications in medication and health,biomimetic vision,and neuromorphic sensory systems,etc.展开更多
Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremend...Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.展开更多
As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalen...As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalent in today’s digital world.In this study,we propose two high-performance R solutions for GWR via Multi-core Parallel(MP)and Compute Unified Device Architecture(CUDA)techniques,respectively GWR-MP and GWR-CUDA.We compared GWR-MP and GWR-CUDA with three existing solutions available in Geographically Weighted Models(GWmodel),Multi-scale GWR(MGWR)and Fast GWR(FastGWR).Results showed that all five solutions perform differently across varying sample sizes,with no single solution a clear winner in terms of computational efficiency.Specifically,solutions given in GWmodel and MGWR provided acceptable computational costs for GWR studies with a relatively small sample size.For a large sample size,GWR-MP and FastGWR provided coherent solutions on a Personal Computer(PC)with a common multi-core configuration,GWR-MP provided more efficient computing capacity for each core or thread than FastGWR.For cases when the sample size was very large,and for these cases only,GWR-CUDA provided the most efficient solution,but should note its I/O cost with small samples.In summary,GWR-MP and GWR-CUDA provided complementary high-performance R solutions to existing ones,where for certain data-rich GWR studies,they should be preferred.展开更多
A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU...A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU speedup through programming under the compute unified device architecture(CUDA)model.The CUDA kernel subroutines are designed to meet with the requirement of high order computing of DG method.The corresponding data structures are constructed in component-wised manners and the thread hierarchy is manipulated in cell-wised or edge-wised manners associated with related integrals involved in solving laminar Navier-Stokes equations,in which the inviscid and viscous flux terms are computed by the local lax-Friedrichs scheme and the second scheme of Bassi&Rebay,respectively.A strong stability preserving Runge-Kutta scheme is then used for time marching of numerical solutions.The resulting GPU-accelerated DG method is first validated by the traditional Couette flow problems with different mesh sizes associated with different orders of approximation,which shows that the orders of convergence,as expected,can be achieved.The numerical simulations of the typical flows over a circular cylinder or a NACA 0012 airfoil are then carried out,and the results are further compared with the analytical solutions or available experimental and numerical values reported in the literature,as well as with a performance analysis of the developed code in terms of GPU speedups.This shows that the costs of computing time of the presented test cases are significantly reduced without losing accuracy,while impressive speedups up to 69.7 times are achieved by the present method in comparison to its CPU counterpart.展开更多
Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, r...Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model(TFM) and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit(GPU) has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture(CUDA) platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.展开更多
Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorit...Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorithm has the advantages of the frequency domain and the time domain algorithms at the same time in terms of high computational accuracy and considerable computational efficiency.In addition,the compute unified device architecture(CUDA)acceleration technique also can be employed to further enhance its simulation efficiency.Numerical examples are carried out to demonstrate the effectiveness of the proposed algorithm.The results indicate that the multipactor threshold can be accurately predicted and the computational efficiency can be improved.展开更多
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul...Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.展开更多
Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of ...Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases.展开更多
The implementation and optimization of the traditional contour generation algorithms are always proposed for the common processor. When processing high resolution images, the performance often exists low efficiency. A...The implementation and optimization of the traditional contour generation algorithms are always proposed for the common processor. When processing high resolution images, the performance often exists low efficiency. A new graphics processing unit (GPU)-based algorithm is proposed to get the clear and integrated contour of leaves. Firstly we implement the classic Sobel operator of edge detection in GPU. Then a simple and effective method is designed to remove the fake edge and a heuristic algorithm is used to repair the broken edge. It is proved by the experiments that the results of our algorithm are natural and realistic in terms of morphology and can be good materials for the virtual plant.展开更多
A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary netw...A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.展开更多
Photodetectors are the fundamental building blocks for many optoelectronic systems,including night vision,optical communications,biomedical imaging,security and motion detection.Carbon nanotubes(CNTs),which have a dir...Photodetectors are the fundamental building blocks for many optoelectronic systems,including night vision,optical communications,biomedical imaging,security and motion detection.Carbon nanotubes(CNTs),which have a direct-bandgap structure,a broad spectral response and a large absorption coefficient,provide an ideal research platform for the exploration of high-performance infrared photodetectors.In the past twenty years,great efforts have been devoted to improve detection sensitivity via adopting high-purity CNT films,various doping strategies,optical manipulations and sensitizing nanostructures.Despite considerable strides made,challenges remain in simultaneously achieving high responsivity,low dark current and fast response.In this Review,we summarize recent advances on key device construction strategies and underlying concepts that contribute to improve performance of fabricated CNT photodetectors.The newly emerging heterojunction gated CNT transistors and their potential are highlighted to overcome trade-offs between the optical and electronic processes.Novel applications of CNT photodetectors are further summarized for advanced optoelectronic technologies.展开更多
In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a...In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.展开更多
Organic thermoelectric materials have emerged as compelling candidates for harvesting low‐grade heat in flexible and lightweight energy systems.Compared to conventional inorganic thermoelectric materials,organic ther...Organic thermoelectric materials have emerged as compelling candidates for harvesting low‐grade heat in flexible and lightweight energy systems.Compared to conventional inorganic thermoelectric materials,organic ther-moelectric materials offer distinct advantages,including intrinsically low ther-mal conductivity,mechanical flexibility,and compatibility with large‐area and solution‐based processing.While p‐type materials such as poly(3,4‐ethyl-enedioxythiophene):polystyrene sulfonate(PEDOT:PSS)have been exten-sively optimized through solvent treatments and de‐doping strategies,recent advances in air‐stable n‐type polymers such as poly(benzodifurandione)(PBFDO)have greatly narrowed the performance gap and made it feasible to construct fully organic thermoelectric modules.This review highlights recent progress in organic thermoelectric materials with a focus on molecular design,doping mechanisms,and device‐level integration.We examine how novel polymers,dopant formulations,and emerging concepts have been driving improvements in the performance of organic thermoelectric materials toward practical application.Our group's previous contributions to module design such as thermal lamination techniques and integrated circuits are presented as case studies of system‐level implementation.Despite their relatively modest power factors and thermoelectric figures of merit,organic thermoelectric materials possess unique advantages in terms of low weight,processability,and scal-ability that make them especially suited for gram‐scale modules and powering small‐scale electronic devices and Internet‐of‐Things systems using ambient thermal energy.展开更多
Flexible devices, such as flexible electronic devices and flexible energy storage devices, have attracted a significant amount of attention in recent years for their potential applications in modern human lives. The d...Flexible devices, such as flexible electronic devices and flexible energy storage devices, have attracted a significant amount of attention in recent years for their potential applications in modern human lives. The development of flexible devices is moving forward rapidly, as the innovation of methods and manufacturing processes has greatly encouraged the research of flexible devices. This review focuses on advanced materials, architecture designs and abundant applications of flexible devices, and discusses the problems and challenges in current situations of flexible devices. We summarize the discovery of novel materials and the design of new architectures for improving the performance of flexible devices. Finally, we introduce the applications of flexible devices as key components in real life.展开更多
In this work, photovoltaic properties of the PBDB-T:ITIC based-NF-PSCs were fully optimized and characterized by tuning the morphology of the active layers and changing the device architecture. First, donor/acceptor(D...In this work, photovoltaic properties of the PBDB-T:ITIC based-NF-PSCs were fully optimized and characterized by tuning the morphology of the active layers and changing the device architecture. First, donor/acceptor(D/A) weight ratios were scanned,and then further optimization was performed by using different additives, i.e. 1,8-diiodooctane(DIO), diphenyl ether(DPE),1-chloronaphthalene(CN) and N-methyl-2-pyrrolidone(NMP), on the basis of best D/A ratio(1:1, w/w), respectively. Finally,the conventional or inverted device architectures with different buffer layers were employed to fabricate NF-PSC devices, and meanwhile, the morphology of the active layers was further optimized by controlling annealing temperature and time. As a result,a record efficiency of 11.3% was achieved, which is the highest result for NF-PSCs. It's also remarkable that the inverted NF-PSCs exhibited long-term stability, i.e. the best-performing devices maintain 83% of their initial PCEs after over 4000 h storage.展开更多
To achieve fabrication and cost competitiveness in organic optoelectronic devices that include organic solar cells(OSCs)and organic light-emitting diodes(OLEDs),it is desirable to have one type of material that can si...To achieve fabrication and cost competitiveness in organic optoelectronic devices that include organic solar cells(OSCs)and organic light-emitting diodes(OLEDs),it is desirable to have one type of material that can simultaneously function as both the electron and hole transport layers(ETLs and HTLs)of the organic devices in all device architectures(i.e.,normal and inverted architectures).We address this issue by proposing and demonstrating Cs-intercalated metal oxides(with various Cs mole ratios)as both the ETL and HTL of an organic optoelectronic device with normal and inverted device architectures.Our results demonstrate that the new approach works well for widely used transition metal oxides of molybdenum oxide(MoOx)and vanadium oxide(V_(2)O_(x)).Moreover,the Cs-intercalated metaloxide-based ETL and HTL can be easily formed under the conditions of a room temperature,water-free and solution-based process.These conditions favor practical applications of OSCs and OLEDs.Notably,with the analyses of the Kelvin Probe System,our approach of Cs-intercalated metal oxides with a wide mole ratio range of transition metals(Mo or V)/Cs from 1:0 to 1:0.75 can offer significant and continuous work function tuning as large as 1.31 eV for functioning as both an ETL and HTL.Consequently,our method of intercalated metal oxides can contribute to the emerging large-scale and low-cost organic optoelectronic devices.展开更多
This study presents a parallel version of the string matching algorithms research tool(SMART)library,implemented on NVIDIA’s compute unified device architecture(CUDA)platform,and uses general-purpose computing on gra...This study presents a parallel version of the string matching algorithms research tool(SMART)library,implemented on NVIDIA’s compute unified device architecture(CUDA)platform,and uses general-purpose computing on graphics processing unit(GPGPU)programming concepts to enhance performance and gain insight into the parallel versions of these algorithms.We have developed the CUDA-enhanced SMART(CUSMART)library,which incorporates parallelized iterations of 64 string matching algorithms,leveraging the CUDA application programming interface.The performance of these algorithms has been assessed across various scenarios to ensure a comprehensive and impartial comparison,allowing for the identification of their strengths and weaknesses in specific application contexts.We have explored and established optimization techniques to gauge their influence on the performance of these algorithms.The results of this study highlight the potential of GPGPU computing in string matching applications through the scalability of algorithms,suggesting significant performance improvements.Furthermore,we have identified the best and worst performing algorithms in various scenarios.展开更多
The continuous downsizing of device has sustained Moore's law in the past 40 years.As the power dissipation becomes more and more serious,a lot of emerging technologies have been adopted in the past decade to solv...The continuous downsizing of device has sustained Moore's law in the past 40 years.As the power dissipation becomes more and more serious,a lot of emerging technologies have been adopted in the past decade to solve the short channel effect,leakage and performance degradation problems.In this paper,the emerging scaling technologies and device innovations,including high-k/metal gate,strain,ultra-shallow junction,tri-gate FinFET,extremely thin SOI and silicon nanowire FET will be reviewed and discussed in terms of the potential and challenge for post-Moore era.展开更多
Bundle adjustment (BA) is a crucial but time consuming step in 3D reconstruction. In this paper, we intend to tackle a special class of BA problems where the reconstructed 3D points are much more numerous than the c...Bundle adjustment (BA) is a crucial but time consuming step in 3D reconstruction. In this paper, we intend to tackle a special class of BA problems where the reconstructed 3D points are much more numerous than the camera parameters, called Massive-Points BA (MPBA) problems. This is often the case when high-resolution images are used. We present a design and implementation of a new bundle adjustment algorithm for efficiently solving the MPBA problems. The use of hardware parallelism, the multi-core CPUs as well as GPUs, is explored. By careful memory-usage design, the graphic-memory limitation is effectively alleviated. Several modern acceleration strategies for bundle adjustment, such as the mixed-precision arithmetics, the embedded point iteration, and the preconditioned conjugate gradients, are explored and compared. By using several high-resolution image datasets, we generate a variety of MFBA problems, with which the performance of five bundle adjustment algorithms are evaluated. The experimental results show that our algorithm is up to 40 times faster than classical Sparse Bundle Adjustment, while maintaining comparable precision.展开更多
基金supported by the National Natural Science Foundation of China (No.11172134)the Funding of Jiangsu Innovation Program for Graduate Education (No.CXLX13_132)
文摘Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.
基金support from the National Natural Science Foundation of China(62204015)the Beijing Natural Science Foundation(L223006).
文摘Infrared optoelectronic sensing is the core of many critical applications such as night vision,health and medication,military,space exploration,etc.Further including mechanical flexibility as a new dimension enables novel features of adaptability and conformability,promising for developing next-generation optoelectronic sensory applications toward reduced size,weight,price,power consumption,and enhanced performance(SWaP^(3)).However,in this emerging research frontier,challenges persist in simultaneously achieving high infrared response and good mechanical deformability in devices and integrated systems.Therefore,we perform a comprehensive review of the design strategies and insights of flexible infrared optoelectronic sensors,including the fundamentals of infrared photodetectors,selection of materials and device architectures,fabrication techniques and design strategies,and the discussion of architectural and functional integration towards applications in wearable optoelectronics and advanced image sensing.Finally,this article offers insights into future directions to practically realize the ultra-high performance and smart sensors enabled by infrared-sensitive materials,covering challenges in materials development and device micro-/nanofabrication.Benchmarks for scaling these techniques across fabrication,performance,and integration are presented,alongside perspectives on potential applications in medication and health,biomimetic vision,and neuromorphic sensory systems,etc.
文摘Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.
基金supported by National Key Research and Development Program of China[grant num-ber 2021YFB3900904]the National Natural Science Foundation of China[grant numbers 42071368,U2033216,41871287].
文摘As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalent in today’s digital world.In this study,we propose two high-performance R solutions for GWR via Multi-core Parallel(MP)and Compute Unified Device Architecture(CUDA)techniques,respectively GWR-MP and GWR-CUDA.We compared GWR-MP and GWR-CUDA with three existing solutions available in Geographically Weighted Models(GWmodel),Multi-scale GWR(MGWR)and Fast GWR(FastGWR).Results showed that all five solutions perform differently across varying sample sizes,with no single solution a clear winner in terms of computational efficiency.Specifically,solutions given in GWmodel and MGWR provided acceptable computational costs for GWR studies with a relatively small sample size.For a large sample size,GWR-MP and FastGWR provided coherent solutions on a Personal Computer(PC)with a common multi-core configuration,GWR-MP provided more efficient computing capacity for each core or thread than FastGWR.For cases when the sample size was very large,and for these cases only,GWR-CUDA provided the most efficient solution,but should note its I/O cost with small samples.In summary,GWR-MP and GWR-CUDA provided complementary high-performance R solutions to existing ones,where for certain data-rich GWR studies,they should be preferred.
基金partially supported by the National Natural Science Foundation of China(No.11972189)the Natural Science Foundation of Jiangsu Province(No.BK20190391)+1 种基金the Natural Science Foundation of Anhui Province(No.1908085QF260)the Priority Academic Program Development of Jiangsu Higher Education Institutions。
文摘A graphics processing unit(GPU)-accelerated discontinuous Galerkin(DG)method is presented for solving two-dimensional laminar flows.The DG method is ported from central processing unit to GPU in a way of achieving GPU speedup through programming under the compute unified device architecture(CUDA)model.The CUDA kernel subroutines are designed to meet with the requirement of high order computing of DG method.The corresponding data structures are constructed in component-wised manners and the thread hierarchy is manipulated in cell-wised or edge-wised manners associated with related integrals involved in solving laminar Navier-Stokes equations,in which the inviscid and viscous flux terms are computed by the local lax-Friedrichs scheme and the second scheme of Bassi&Rebay,respectively.A strong stability preserving Runge-Kutta scheme is then used for time marching of numerical solutions.The resulting GPU-accelerated DG method is first validated by the traditional Couette flow problems with different mesh sizes associated with different orders of approximation,which shows that the orders of convergence,as expected,can be achieved.The numerical simulations of the typical flows over a circular cylinder or a NACA 0012 airfoil are then carried out,and the results are further compared with the analytical solutions or available experimental and numerical values reported in the literature,as well as with a performance analysis of the developed code in terms of GPU speedups.This shows that the costs of computing time of the presented test cases are significantly reduced without losing accuracy,while impressive speedups up to 69.7 times are achieved by the present method in comparison to its CPU counterpart.
基金supported by the National Natural Science Foundation of China(51575304)
文摘Core shooting process is the most widely used technique to make sand cores and it plays an important role in the quality of sand cores. Although numerical simulation can hopefully optimize the core shooting process, research on numerical simulation of the core shooting process is very limited. Based on a two-fluid model(TFM) and a kinetic-friction constitutive correlation, a program for 3D numerical simulation of the core shooting process has been developed and achieved good agreements with in-situ experiments. To match the needs of engineering applications, a graphics processing unit(GPU) has also been used to improve the calculation efficiency. The parallel algorithm based on the Compute Unified Device Architecture(CUDA) platform can significantly decrease computing time by multi-threaded GPU. In this work, the program accelerated by CUDA parallelization method was developed and the accuracy of the calculations was ensured by comparing with in-situ experimental results photographed by a high-speed camera. The design and optimization of the parallel algorithm were discussed. The simulation result of a sand core test-piece indicated the improvement of the calculation efficiency by GPU. The developed program has also been validated by in-situ experiments with a transparent core-box, a high-speed camera, and a pressure measuring system. The computing time of the parallel program was reduced by nearly 95% while the simulation result was still quite consistent with experimental data. The GPU parallelization method can successfully solve the problem of low computational efficiency of the 3D sand shooting simulation program, and thus the developed GPU program is appropriate for engineering applications.
基金This work was supported by the National Natural Science Foundation of China(61571022,61971022)the National Key laboratory Foundation(HTKJ2019KI504013,61424020305).
文摘Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorithm has the advantages of the frequency domain and the time domain algorithms at the same time in terms of high computational accuracy and considerable computational efficiency.In addition,the compute unified device architecture(CUDA)acceleration technique also can be employed to further enhance its simulation efficiency.Numerical examples are carried out to demonstrate the effectiveness of the proposed algorithm.The results indicate that the multipactor threshold can be accurately predicted and the computational efficiency can be improved.
基金supported by College of William and Mary,Virginia Institute of Marine Science for the study environment
文摘Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.
基金the National Basic Research Program(973) of China(No.2010CB834300)the Biomedical Engineering Cross-Research Fund of Shanghai Jiao Tong University(Nos.YG2011MS49 and YG2013MS65)
文摘Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases.
基金Project supported by the Shanghai Leading Academic Discipline Project(Grant No.J50103)the National Natural Science Foundation of China(Grant No.60970150)
文摘The implementation and optimization of the traditional contour generation algorithms are always proposed for the common processor. When processing high resolution images, the performance often exists low efficiency. A new graphics processing unit (GPU)-based algorithm is proposed to get the clear and integrated contour of leaves. Firstly we implement the classic Sobel operator of edge detection in GPU. Then a simple and effective method is designed to remove the fake edge and a heuristic algorithm is used to repair the broken edge. It is proved by the experiments that the results of our algorithm are natural and realistic in terms of morphology and can be good materials for the virtual plant.
基金National Natural Science Foundation of China (No. 60975084)Natural Science Foundation of Fujian Province,China (No.2011J05159)
文摘A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.
基金supported by National Science Foundation of China(U21A6004,62225101 and 62101008).
文摘Photodetectors are the fundamental building blocks for many optoelectronic systems,including night vision,optical communications,biomedical imaging,security and motion detection.Carbon nanotubes(CNTs),which have a direct-bandgap structure,a broad spectral response and a large absorption coefficient,provide an ideal research platform for the exploration of high-performance infrared photodetectors.In the past twenty years,great efforts have been devoted to improve detection sensitivity via adopting high-purity CNT films,various doping strategies,optical manipulations and sensitizing nanostructures.Despite considerable strides made,challenges remain in simultaneously achieving high responsivity,low dark current and fast response.In this Review,we summarize recent advances on key device construction strategies and underlying concepts that contribute to improve performance of fabricated CNT photodetectors.The newly emerging heterojunction gated CNT transistors and their potential are highlighted to overcome trade-offs between the optical and electronic processes.Novel applications of CNT photodetectors are further summarized for advanced optoelectronic technologies.
基金Supported by National Nature Science Foundation of China(61472289)the Nature Science Foundation of Hubei Province(2015CFB254)
文摘In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods.
基金Japan Science and Technology Agency,Grant/Award Number:JPMJTR23R6。
文摘Organic thermoelectric materials have emerged as compelling candidates for harvesting low‐grade heat in flexible and lightweight energy systems.Compared to conventional inorganic thermoelectric materials,organic ther-moelectric materials offer distinct advantages,including intrinsically low ther-mal conductivity,mechanical flexibility,and compatibility with large‐area and solution‐based processing.While p‐type materials such as poly(3,4‐ethyl-enedioxythiophene):polystyrene sulfonate(PEDOT:PSS)have been exten-sively optimized through solvent treatments and de‐doping strategies,recent advances in air‐stable n‐type polymers such as poly(benzodifurandione)(PBFDO)have greatly narrowed the performance gap and made it feasible to construct fully organic thermoelectric modules.This review highlights recent progress in organic thermoelectric materials with a focus on molecular design,doping mechanisms,and device‐level integration.We examine how novel polymers,dopant formulations,and emerging concepts have been driving improvements in the performance of organic thermoelectric materials toward practical application.Our group's previous contributions to module design such as thermal lamination techniques and integrated circuits are presented as case studies of system‐level implementation.Despite their relatively modest power factors and thermoelectric figures of merit,organic thermoelectric materials possess unique advantages in terms of low weight,processability,and scal-ability that make them especially suited for gram‐scale modules and powering small‐scale electronic devices and Internet‐of‐Things systems using ambient thermal energy.
基金supported by the National Key R&D Program of China(Nos.2017YFA0208200,2016YFB0700600,2015CB659300)the National Natural Science Foundation of China(Nos.21403105,21573108)the Fundamental Research Funds for the Central Universities(No.020514380107)
文摘Flexible devices, such as flexible electronic devices and flexible energy storage devices, have attracted a significant amount of attention in recent years for their potential applications in modern human lives. The development of flexible devices is moving forward rapidly, as the innovation of methods and manufacturing processes has greatly encouraged the research of flexible devices. This review focuses on advanced materials, architecture designs and abundant applications of flexible devices, and discusses the problems and challenges in current situations of flexible devices. We summarize the discovery of novel materials and the design of new architectures for improving the performance of flexible devices. Finally, we introduce the applications of flexible devices as key components in real life.
基金supported by the National Basic Research Program(2014CB643501)the National Natural Science Foundation of China(91333204,21325419)the Chinese Academy of Sciences(XDB12030200)
文摘In this work, photovoltaic properties of the PBDB-T:ITIC based-NF-PSCs were fully optimized and characterized by tuning the morphology of the active layers and changing the device architecture. First, donor/acceptor(D/A) weight ratios were scanned,and then further optimization was performed by using different additives, i.e. 1,8-diiodooctane(DIO), diphenyl ether(DPE),1-chloronaphthalene(CN) and N-methyl-2-pyrrolidone(NMP), on the basis of best D/A ratio(1:1, w/w), respectively. Finally,the conventional or inverted device architectures with different buffer layers were employed to fabricate NF-PSC devices, and meanwhile, the morphology of the active layers was further optimized by controlling annealing temperature and time. As a result,a record efficiency of 11.3% was achieved, which is the highest result for NF-PSCs. It's also remarkable that the inverted NF-PSCs exhibited long-term stability, i.e. the best-performing devices maintain 83% of their initial PCEs after over 4000 h storage.
基金This study was supported by the University Grant Council of the University of Hong Kong(Grant Nos.10401466 and 201111159062)the General Research Fund(Grant Nos.HKU711813 and HKU711612E)+1 种基金an RGC-NSFC grant(N_HKU709/12)grant CAS14601 from the CAS-Croucher Funding Scheme for Joint Laboratories.
文摘To achieve fabrication and cost competitiveness in organic optoelectronic devices that include organic solar cells(OSCs)and organic light-emitting diodes(OLEDs),it is desirable to have one type of material that can simultaneously function as both the electron and hole transport layers(ETLs and HTLs)of the organic devices in all device architectures(i.e.,normal and inverted architectures).We address this issue by proposing and demonstrating Cs-intercalated metal oxides(with various Cs mole ratios)as both the ETL and HTL of an organic optoelectronic device with normal and inverted device architectures.Our results demonstrate that the new approach works well for widely used transition metal oxides of molybdenum oxide(MoOx)and vanadium oxide(V_(2)O_(x)).Moreover,the Cs-intercalated metaloxide-based ETL and HTL can be easily formed under the conditions of a room temperature,water-free and solution-based process.These conditions favor practical applications of OSCs and OLEDs.Notably,with the analyses of the Kelvin Probe System,our approach of Cs-intercalated metal oxides with a wide mole ratio range of transition metals(Mo or V)/Cs from 1:0 to 1:0.75 can offer significant and continuous work function tuning as large as 1.31 eV for functioning as both an ETL and HTL.Consequently,our method of intercalated metal oxides can contribute to the emerging large-scale and low-cost organic optoelectronic devices.
基金Project supported by the Scientific and Technological Research Council of Türkiye(No.117E142)Open access funding provided by the Scientific and Technological Research Council of Türkiye(TÜBİTAK)。
文摘This study presents a parallel version of the string matching algorithms research tool(SMART)library,implemented on NVIDIA’s compute unified device architecture(CUDA)platform,and uses general-purpose computing on graphics processing unit(GPGPU)programming concepts to enhance performance and gain insight into the parallel versions of these algorithms.We have developed the CUDA-enhanced SMART(CUSMART)library,which incorporates parallelized iterations of 64 string matching algorithms,leveraging the CUDA application programming interface.The performance of these algorithms has been assessed across various scenarios to ensure a comprehensive and impartial comparison,allowing for the identification of their strengths and weaknesses in specific application contexts.We have explored and established optimization techniques to gauge their influence on the performance of these algorithms.The results of this study highlight the potential of GPGPU computing in string matching applications through the scalability of algorithms,suggesting significant performance improvements.Furthermore,we have identified the best and worst performing algorithms in various scenarios.
文摘The continuous downsizing of device has sustained Moore's law in the past 40 years.As the power dissipation becomes more and more serious,a lot of emerging technologies have been adopted in the past decade to solve the short channel effect,leakage and performance degradation problems.In this paper,the emerging scaling technologies and device innovations,including high-k/metal gate,strain,ultra-shallow junction,tri-gate FinFET,extremely thin SOI and silicon nanowire FET will be reviewed and discussed in terms of the potential and challenge for post-Moore era.
基金supported by the National Natural Science Foundation of China under Grant No.60835003the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDA06030300
文摘Bundle adjustment (BA) is a crucial but time consuming step in 3D reconstruction. In this paper, we intend to tackle a special class of BA problems where the reconstructed 3D points are much more numerous than the camera parameters, called Massive-Points BA (MPBA) problems. This is often the case when high-resolution images are used. We present a design and implementation of a new bundle adjustment algorithm for efficiently solving the MPBA problems. The use of hardware parallelism, the multi-core CPUs as well as GPUs, is explored. By careful memory-usage design, the graphic-memory limitation is effectively alleviated. Several modern acceleration strategies for bundle adjustment, such as the mixed-precision arithmetics, the embedded point iteration, and the preconditioned conjugate gradients, are explored and compared. By using several high-resolution image datasets, we generate a variety of MFBA problems, with which the performance of five bundle adjustment algorithms are evaluated. The experimental results show that our algorithm is up to 40 times faster than classical Sparse Bundle Adjustment, while maintaining comparable precision.