Por favor, use este identificador para citar o enlazar este ítem: https://doi.org/10.1007/s11227-025-07473-7

Título: Saffe: multimodal model composition with semantic‑alignment fusion of frozen encoders
Fecha de publicación: 7-jul-2025
Editorial: Springer
Cita bibliográfica: The Journal of Supercomputing (2025) 81:1114
ISSN: Print: 0920-8542
Electronic: 1573-0484
Palabras clave: Multimodal fusion
Frozen model
Transformer encoder
Decoder
Mid-fusion
Resumen: Transformer-based multimodal models often require expensive, full-model training on task-specific all-modality datasets to achieve high accuracy on targeted downstream tasks. To reduce this significant cost, we introduce SAFFE, a methodology for building accurate, task-specific multimodal models with minimal training, using only standard GPU hardware. SAFFE leverages off-the-shelf, pre-trained, frozen unimodal encoders for each input modality (e.g., text, image, or audio) and connects them through a lightweight, trainable component called the FusionAlign Module (FAM). FAM is a bottleneck mid-fusion neural network, trained on the target data set to align the outputs of the independently pre-trained unimodal encoders. This approach eliminates the need for end-to-end training while maintaining strong accuracy for the downstream task. As a proof of concept, we validate SAFFE on image retrieval and language understanding tasks. SAFFE-derived models outperform state-of-the-art multimodal systems on datasets such as CIFAR-10, ImageNet-100, and COCO, achieving competitive results with significantly fewer trainable parameters and training time.
Autor/es principal/es: Kulasekara, Maithri
Inglés‑Romero, Juan F.
Imbernón, Baldomero
Abellán, José L.
Versión del editor: https://link.springer.com/article/10.1007/s11227-025-07473-7
URI: http://hdl.handle.net/10201/157140
DOI: https://doi.org/10.1007/s11227-025-07473-7
Tipo de documento: info:eu-repo/semantics/article
Número páginas / Extensión: 31
Derechos: info:eu-repo/semantics/openAccess
Atribución 4.0 Internacional
Descripción: © 2025, The Author(s). This manuscript version is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by/4.0/. This document is the Published version of a Published Work that appeared in final form in Journal of Supercomputing. To access the final edited and published work see https://doi.org/10.1007/s11227-025-07473-7
Aparece en las colecciones:Artículos

Ficheros en este ítem:
Fichero Descripción TamañoFormato 
s11227-025-07473-7.pdf2,92 MBAdobe PDFVista previa
Visualizar/Abrir


Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons Creative Commons