Treffer: Leveraging a Vision-Language Model with Natural Text Supervision for MRI Retrieval, Captioning, Classification, and Visual Question Answering.

Title:
Leveraging a Vision-Language Model with Natural Text Supervision for MRI Retrieval, Captioning, Classification, and Visual Question Answering.
Source:
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference [Annu Int Conf IEEE Eng Med Biol Soc] 2025 Jul; Vol. 2025, pp. 1-7.
Publication Type:
Journal Article
Language:
English
Journal Info:
Publisher: [IEEE] Country of Publication: United States NLM ID: 101763872 Publication Model: Print Cited Medium: Internet ISSN: 2694-0604 (Electronic) Linking ISSN: 23757477 NLM ISO Abbreviation: Annu Int Conf IEEE Eng Med Biol Soc Subsets: MEDLINE
Imprint Name(s):
Original Publication: [Piscataway, NJ] : [IEEE], [2007]-
Comments:
Update of: bioRxiv. 2025 Feb 20:2025.02.15.638446. doi: 10.1101/2025.02.15.638446.. (PMID: 40027630)
Entry Date(s):
Date Created: 20251203 Date Completed: 20251203 Latest Revision: 20251209
Update Code:
20251209
DOI:
10.1109/EMBC58623.2025.11251809
PMID:
41336503
Database:
MEDLINE

Weitere Informationen

Large multimodal models are now extensively used worldwide, with the most powerful ones trained on massive, general-purpose datasets. Despite their rapid deployment, concerns persist regarding the quality and domain relevance of the training data, especially in radiology, medical research, and neuroscience. Additionally, healthcare data privacy is paramount when querying models trained on medical data, as is transparency regarding service hosting and data storage. So far, most deep learning algorithms in radiologic research are designed to perform a specific task (e.g., diagnostic classification) and cannot be prompted to perform multiple tasks using natural language. In this work, we introduce a framework based on vector retrieval and contrastive learning to efficiently learn visual brain MRI concepts via natural language supervision. We show how the method learns to identify factors that affect the brain in Alzheimer's disease (AD) via joint embedding and natural language supervision. First, we pretrain separate text and image encoders using self-supervised learning, and jointly fine-tune these encoders to develop a shared embedding space. We train our model to perform multiple tasks, including MRI retrieval, MRI captioning, and MRI classification. We show its versatility by developing a retrieval and re-ranking mechanism along with a transformer decoder for visual question answering. Clinical Relevance - By learning a cross-modal embedding of radiologic features and text, our approach can learn to perform diagnostic and prognostic assessments in AD research as well as to assist practicing clinicians. Integrating medical imaging with clinical descriptions and text prompts, we aim to provide a general, versatile tool for detecting radiologic features described by text, offering a new approach to radiologic research.