Treffer: Leveraging Deep Learning for Fault Detection and Localization in Distributed Systems

Title:

Leveraging Deep Learning for Fault Detection and Localization in Distributed Systems

Authors:

Debolina Ghosh, Jay Prakash Singh

Source:

IEEE Access, Vol 13, Pp 120069-120084 (2025)

Publisher Information:

IEEE, 2025.

Publication Year:

2025

Collection:

LCC:Electrical engineering. Electronics. Nuclear engineering

Subject Terms:

Fault localization, deep learning in distributed systems, log-based failure detection, CNN for fault diagnosis, sustainable computing, Electrical engineering. Electronics. Nuclear engineering, TK1-9971

Document Type:

Fachzeitschrift article

File Description:

electronic resource

Language:

English

ISSN:

2169-3536

Relation:

https://ieeexplore.ieee.org/document/11075581/; https://doaj.org/toc/2169-3536

DOI:

10.1109/ACCESS.2025.3587529

Access URL:

https://doaj.org/article/6fb17ce72fc04c718acccbe129bbcb56

Accession Number:

edsdoj.6fb17ce72fc04c718acccbe129bbcb56

Database:

Directory of Open Access Journals

Weitere Informationen

The dynamic and complex nature of distributed systems makes fault localization extremely difficult, frequently leading to extended outages and higher operating expenses. A deep learning-based fault localization framework that combines Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), LSTM+CNN, and Autoencoder+LSTM models is proposed in this study. These models undergo extensive preprocessing, including log parsing, feature extraction using TF-IDF and Word2Vec, and min-max normalisation, before being trained and assessed on five benchmark datasets: HDFS, OpenStack, Spark, Hadoop, and BGL. To ensure robustness, the methodology incorporates a 5-fold cross-validation strategy, model-specific architecture tuning, and 1-D sequence modelling. According to experimental results, CNN performs best overall on the HDFS dataset, with an Mean Squared Error (MSE) of 0.00002 and an Coefficient of Determination (R2 Score) Score of 0.996. CNN continuously beats other models in accuracy and performance across all datasets. The key contributions of this study are: 1) a thorough fault localization framework built with deep learning for distributed systems; 2) a comparison of five cutting-edge architectures on five real-world datasets; and 3) statistically validated performance benchmarks backed by Wilcoxon signed-rank tests and t-tests. These contributions provide useful information for implementing accurate and scalable fault localization in distributed computing environments found in the real world.

Treffer: Leveraging Deep Learning for Fault Detection and Localization in Distributed Systems

Weitere Informationen

Links

Zusatz-Funktionen