Saffe: multimodal model composition with semantic‑alignment fusion of frozen encoders

Kulasekara, Maithri; Inglés‑Romero, Juan F.; Imbernón, Baldomero; Abellán, José L.

Por favor, use este identificador para citar o enlazar este ítem: https://doi.org/10.1007/s11227-025-07473-7

RefMan EndNote BibTex RefWorks Excel CSV PDF Mendeley

Título:	Saffe: multimodal model composition with semantic‑alignment fusion of frozen encoders
Fecha de publicación:	7-jul-2025
Editorial:	Springer
Cita bibliográfica:	The Journal of Supercomputing (2025) 81:1114
ISSN:	Print: 0920-8542 Electronic: 1573-0484
Palabras clave:	Multimodal fusion Frozen model Transformer encoder Decoder Mid-fusion
Resumen:	Transformer-based multimodal models often require expensive, full-model training on task-specific all-modality datasets to achieve high accuracy on targeted downstream tasks. To reduce this significant cost, we introduce SAFFE, a methodology for building accurate, task-specific multimodal models with minimal training, using only standard GPU hardware. SAFFE leverages off-the-shelf, pre-trained, frozen unimodal encoders for each input modality (e.g., text, image, or audio) and connects them through a lightweight, trainable component called the FusionAlign Module (FAM). FAM is a bottleneck mid-fusion neural network, trained on the target data set to align the outputs of the independently pre-trained unimodal encoders. This approach eliminates the need for end-to-end training while maintaining strong accuracy for the downstream task. As a proof of concept, we validate SAFFE on image retrieval and language understanding tasks. SAFFE-derived models outperform state-of-the-art multimodal systems on datasets such as CIFAR-10, ImageNet-100, and COCO, achieving competitive results with significantly fewer trainable parameters and training time.
Autor/es principal/es:	Kulasekara, Maithri Inglés‑Romero, Juan F. Imbernón, Baldomero Abellán, José L.
Versión del editor:	https://link.springer.com/article/10.1007/s11227-025-07473-7
URI:	http://hdl.handle.net/10201/157140
DOI:	https://doi.org/10.1007/s11227-025-07473-7
Tipo de documento:	info:eu-repo/semantics/article
Número páginas / Extensión:	31
Derechos:	info:eu-repo/semantics/openAccess Atribución 4.0 Internacional
Descripción:	© 2025, The Author(s). This manuscript version is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by/4.0/. This document is the Published version of a Published Work that appeared in final form in Journal of Supercomputing. To access the final edited and published work see https://doi.org/10.1007/s11227-025-07473-7
Aparece en las colecciones:	Artículos

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
s11227-025-07473-7.pdf		2,92 MB	Adobe PDF	Visualizar/Abrir

Mostrar el registro Dublin Core completo del ítem Mostrar el registro PREMIS del ítem Estadísticas

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons