Page fault handling is a critical technique that increases the amount of memory available to programs,but also puts pressure on storage devices.Major page faults incur a high latency and are recognized as one of the m...Page fault handling is a critical technique that increases the amount of memory available to programs,but also puts pressure on storage devices.Major page faults incur a high latency and are recognized as one of the most common causes of performance problems in cluster systems.Modern cluster systems generate a large volume of system logs and resource use data,and analyzing this data is an advocated basis for failure prediction.We set up three deep learning models including the Recurrent Neural Network(RNN),Long Short-Term Memory(LSTM)and Temporal Convolution Network(TCN).To the best of our knowledge,there is no work which predicted major page faults and page fault events in a large cluster system.In this paper,we(a)propose an approach for predicting major page faults and page fault events,and(b)evaluate the three deep learning models on real system logs and resource use data.As part of our contributions,we(a)compare the performance of the RNN,LSTM and TCN deep learning models,(b)validate our major page fault prediction approach on two large cluster systems,and(c)provide insights into major page faults and page fault events.Our work highlights empirical observations that could facilitate prediction of major page faults and page fault events in large cluster systems.展开更多
文摘Page fault handling is a critical technique that increases the amount of memory available to programs,but also puts pressure on storage devices.Major page faults incur a high latency and are recognized as one of the most common causes of performance problems in cluster systems.Modern cluster systems generate a large volume of system logs and resource use data,and analyzing this data is an advocated basis for failure prediction.We set up three deep learning models including the Recurrent Neural Network(RNN),Long Short-Term Memory(LSTM)and Temporal Convolution Network(TCN).To the best of our knowledge,there is no work which predicted major page faults and page fault events in a large cluster system.In this paper,we(a)propose an approach for predicting major page faults and page fault events,and(b)evaluate the three deep learning models on real system logs and resource use data.As part of our contributions,we(a)compare the performance of the RNN,LSTM and TCN deep learning models,(b)validate our major page fault prediction approach on two large cluster systems,and(c)provide insights into major page faults and page fault events.Our work highlights empirical observations that could facilitate prediction of major page faults and page fault events in large cluster systems.