Treffer: Fusion of deep transfer learning models with Gannet optimisation algorithm for an advanced image captioning system for visual disabilities.

Title:
Fusion of deep transfer learning models with Gannet optimisation algorithm for an advanced image captioning system for visual disabilities.
Authors:
Alkhaldi TM; Department of Educational Technologies, Imam Abdulrahman bin Faisal University, Dammam, Saudi Arabia., Asiri MM; Department of Computer Science, Applied College at Mahayil, King Khalid University, Abha, Saudi Arabia. abusharara@kku.edu.sa., Alzahrani F; Department of Information and Computer Science, College of Computing and Mathematics, King Fahad University of Petroleum and Minerals, Dhahran, Saudi Arabia., Sharif MM; Department of Computer and Self Development, Preparatory Year Deanship, Prince Sattam bin Abdulaziz University, AlKharj, Saudi Arabia.; King Salman Centre for Disability Research, Riyadh, 11614, Saudi Arabia.
Source:
Scientific reports [Sci Rep] 2025 Nov 18; Vol. 15 (1), pp. 40446. Date of Electronic Publication: 2025 Nov 18.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: Nature Publishing Group Country of Publication: England NLM ID: 101563288 Publication Model: Electronic Cited Medium: Internet ISSN: 2045-2322 (Electronic) Linking ISSN: 20452322 NLM ISO Abbreviation: Sci Rep Subsets: MEDLINE
Imprint Name(s):
Original Publication: London : Nature Publishing Group, copyright 2011-
References:
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):539-559. (PMID: 35130142)
Sci Rep. 2025 Mar 2;15(1):7366. (PMID: 40025193)
Contributed Indexing:
Keywords: Contrast enhancement; Gannet optimisation algorithm; Image captioning; Transfer learning; Visual disabilities
Entry Date(s):
Date Created: 20251118 Date Completed: 20251118 Latest Revision: 20251121
Update Code:
20251121
PubMed Central ID:
PMC12627861
DOI:
10.1038/s41598-025-24171-9
PMID:
41253878
Database:
MEDLINE

Weitere Informationen

The issue of generating a natural language explanation of images to define their visual content has garnered significant attention in computer vision (CV) and natural language processing (NLP). It is driven by applications such as image virtual assistants, indexing and retrieval, image perception, and assistance for visually challenged people. While this kind of person utilizes other senses, such as hearing and touch, for identifying events and objects, their quality of life is reduced to a typical level. Automated Image captioning generates captions that will be spoken aloud to individuals with disabilities, thereby recognizing objects and events happening nearby them. With the aid of image captioning techniques and artificial intelligence (AI) speech recognition methods, visually impaired individuals can quickly understand the content of an image, as these methods can automatically generate text captions that accurately describe the image's content. Therefore, this study presents a novel Fusion of Deep Transfer Learning Models and the Gannet Optimisation Algorithm for an Advanced Image Captioning System for Visual Disabilities (FDTLGO-AICSVD) model. The aim is to present a robust and efficient image captioning framework specifically designed to assist visually impaired persons through precise and descriptive image-to-text conversion. Initially, the FDTLGO-AICSVD approach comprises two distinct types of image preprocessing: noise removal and contrast enhancement, aimed at improving the clarity of visual features. Text preprocessing involves distinct steps to standardize and prepare the textual data for analysis. Furthermore, DenseNet121, VGG19, and MobileNetV2 models are utilized for extracting features from image data, whereas Term Frequency Inverse Document Frequency (TF-IDF) is applied for extracting features from text data. To achieve optimal performance, the Gannet optimization algorithm (GOA) model is employed for hyperparameter tuning, enabling the method to generate precise and context-aware captions. A wide range of experimentation of the FDTLGO-AICSVD method is performed under the Flickr8k and Flickr30k datasets. The comparison study of the FDTLGO-AICSVD method portrayed a superior BLEU-4 score of 45.11% over the Flickr8K dataset and 58.91% over the Flickr30K dataset, along with a significantly higher CIDEr score of 63.17 on Flickr8K and 69.81 on Flickr30K, demonstrating the enhanced descriptive accuracy and language generation capability of the model across both datasets.
(© 2025. The Author(s).)

Declarations. Competing interests: The authors declare no competing interests.