Many scientific fields increasingly use high-performance computing(HPC)to process and analyze massive amounts of experimental data while storage systems in today's HPC environments have to cope with new access pat...Many scientific fields increasingly use high-performance computing(HPC)to process and analyze massive amounts of experimental data while storage systems in today's HPC environments have to cope with new access patterns.These patterns include many metadata operations,small I/O requests,or randomized file I/O,while general-purpose parallel file systems have been optimized for sequential shared access to large files.Burst buffer file systems create a separate file system that applications can use to store temporary data.They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it.However,burst buffer file systems typically offer many features that a scientific application,running in isolation for a limited amount of time,does not require.We present GekkoFS,a temporary,highly-scalable file system which has been specifically optimized for the aforementioned use cases.GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most(not all)applications.GekkoFS is,therefore,able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes,significantly outperforming the capabilities of common parallel file systems.展开更多
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order ...Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.展开更多
基金This work has been funded by the German Research Foundation(DFG)through the Priority Programme 1648"Software for Exascale Computing"and the ADA-FS projectalso partially supported by the Spanish Ministry of Science and Innovation under Grant No.TIN2015-65316+1 种基金the Generalitat de Catalunya under Contract 2014-SGR-1051as well as the European Union's Horizon 2020 Research and Innovation Programme,under Grant Agreement No.671951(NEXTGenIO).
文摘Many scientific fields increasingly use high-performance computing(HPC)to process and analyze massive amounts of experimental data while storage systems in today's HPC environments have to cope with new access patterns.These patterns include many metadata operations,small I/O requests,or randomized file I/O,while general-purpose parallel file systems have been optimized for sequential shared access to large files.Burst buffer file systems create a separate file system that applications can use to store temporary data.They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it.However,burst buffer file systems typically offer many features that a scientific application,running in isolation for a limited amount of time,does not require.We present GekkoFS,a temporary,highly-scalable file system which has been specifically optimized for the aforementioned use cases.GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most(not all)applications.GekkoFS is,therefore,able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes,significantly outperforming the capabilities of common parallel file systems.
基金supported in part by Spanish Government and European ERDF under Grant Nos. TIN2007-66423, TIN2010-21291-C02-01 and TIN2007-60625gaZ:T48 research group (Arag'on Government and European ESF)+1 种基金Consolider CSD2007-00050 (Spanish Government)HiPEAC-2 NoE (European FP7/ICT 217068)
文摘Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.