Treffer: Leveraging Deep Learning for Fault Detection and Localization in Distributed Systems
Weitere Informationen
The dynamic and complex nature of distributed systems makes fault localization extremely difficult, frequently leading to extended outages and higher operating expenses. A deep learning-based fault localization framework that combines Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), LSTM+CNN, and Autoencoder+LSTM models is proposed in this study. These models undergo extensive preprocessing, including log parsing, feature extraction using TF-IDF and Word2Vec, and min-max normalisation, before being trained and assessed on five benchmark datasets: HDFS, OpenStack, Spark, Hadoop, and BGL. To ensure robustness, the methodology incorporates a 5-fold cross-validation strategy, model-specific architecture tuning, and 1-D sequence modelling. According to experimental results, CNN performs best overall on the HDFS dataset, with an Mean Squared Error (MSE) of 0.00002 and an Coefficient of Determination (R2 Score) Score of 0.996. CNN continuously beats other models in accuracy and performance across all datasets. The key contributions of this study are: 1) a thorough fault localization framework built with deep learning for distributed systems; 2) a comparison of five cutting-edge architectures on five real-world datasets; and 3) statistically validated performance benchmarks backed by Wilcoxon signed-rank tests and t-tests. These contributions provide useful information for implementing accurate and scalable fault localization in distributed computing environments found in the real world.