3D reverse time migration in tiled transversly isotropic(3D RTM-TTI) is the most precise model for complex seismic imaging.However,vast computing time of 3D RTM-TTI prevents it from being widely used,which is addresse...3D reverse time migration in tiled transversly isotropic(3D RTM-TTI) is the most precise model for complex seismic imaging.However,vast computing time of 3D RTM-TTI prevents it from being widely used,which is addressed by providing parallel solutions for 3D RTM-TTI on multicores and many-cores.After data parallelism and memory optimization,the hot spot function of 3D RTMTTI gains 35.99 X speedup on two Intel Xeon CPUs,89.75 X speedup on one Intel Xeon Phi,89.92 X speedup on one NVIDIA K20 GPU compared with serial CPU baseline.This study makes RTM-TTI practical in industry.Since the computation pattern in RTM is stencil,the approaches also benefit a wide range of stencil-based applications.展开更多
The Distributed Shared Memory(DSM)architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap,and it inevitably exhibits Non-Uniform Memory Access(NUMA)to shared-memory ...The Distributed Shared Memory(DSM)architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap,and it inevitably exhibits Non-Uniform Memory Access(NUMA)to shared-memory parallel applications.Failure to adapt to the NUMA effect can significantly downgrade application performance,especially on today’s manycore platforms with tens to hundreds of cores.However,traditional approaches such as first-touch and memory policy fall short in false page-sharing,fragmentation,or ease of use.In this paper,we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation.Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.展开更多
基金Supported by the National Natural Science Foundation of China(No.61432018)
文摘3D reverse time migration in tiled transversly isotropic(3D RTM-TTI) is the most precise model for complex seismic imaging.However,vast computing time of 3D RTM-TTI prevents it from being widely used,which is addressed by providing parallel solutions for 3D RTM-TTI on multicores and many-cores.After data parallelism and memory optimization,the hot spot function of 3D RTMTTI gains 35.99 X speedup on two Intel Xeon CPUs,89.75 X speedup on one Intel Xeon Phi,89.92 X speedup on one NVIDIA K20 GPU compared with serial CPU baseline.This study makes RTM-TTI practical in industry.Since the computation pattern in RTM is stencil,the approaches also benefit a wide range of stencil-based applications.
基金supported by the National Key Research and Development Program of China(No.2016YFB0201300)。
文摘The Distributed Shared Memory(DSM)architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap,and it inevitably exhibits Non-Uniform Memory Access(NUMA)to shared-memory parallel applications.Failure to adapt to the NUMA effect can significantly downgrade application performance,especially on today’s manycore platforms with tens to hundreds of cores.However,traditional approaches such as first-touch and memory policy fall short in false page-sharing,fragmentation,or ease of use.In this paper,we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation.Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.