Treffer: Leveraging Deep Learning for Fault Detection and Localization in Distributed Systems

Title:
Leveraging Deep Learning for Fault Detection and Localization in Distributed Systems
Source:
IEEE Access, Vol 13, Pp 120069-120084 (2025)
Publisher Information:
IEEE, 2025.
Publication Year:
2025
Collection:
LCC:Electrical engineering. Electronics. Nuclear engineering
Document Type:
Fachzeitschrift article
File Description:
electronic resource
Language:
English
ISSN:
2169-3536
DOI:
10.1109/ACCESS.2025.3587529
Accession Number:
edsdoj.6fb17ce72fc04c718acccbe129bbcb56
Database:
Directory of Open Access Journals

Weitere Informationen

The dynamic and complex nature of distributed systems makes fault localization extremely difficult, frequently leading to extended outages and higher operating expenses. A deep learning-based fault localization framework that combines Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), LSTM+CNN, and Autoencoder+LSTM models is proposed in this study. These models undergo extensive preprocessing, including log parsing, feature extraction using TF-IDF and Word2Vec, and min-max normalisation, before being trained and assessed on five benchmark datasets: HDFS, OpenStack, Spark, Hadoop, and BGL. To ensure robustness, the methodology incorporates a 5-fold cross-validation strategy, model-specific architecture tuning, and 1-D sequence modelling. According to experimental results, CNN performs best overall on the HDFS dataset, with an Mean Squared Error (MSE) of 0.00002 and an Coefficient of Determination (R2 Score) Score of 0.996. CNN continuously beats other models in accuracy and performance across all datasets. The key contributions of this study are: 1) a thorough fault localization framework built with deep learning for distributed systems; 2) a comparison of five cutting-edge architectures on five real-world datasets; and 3) statistically validated performance benchmarks backed by Wilcoxon signed-rank tests and t-tests. These contributions provide useful information for implementing accurate and scalable fault localization in distributed computing environments found in the real world.