Por favor, use este identificador para citar o enlazar este ítem:
https://doi.org/10.1007/s11227-025-07473-7


Título: | Saffe: multimodal model composition with semantic‑alignment fusion of frozen encoders |
Fecha de publicación: | 7-jul-2025 |
Editorial: | Springer |
Cita bibliográfica: | The Journal of Supercomputing (2025) 81:1114 |
ISSN: | Print: 0920-8542 Electronic: 1573-0484 |
Palabras clave: | Multimodal fusion Frozen model Transformer encoder Decoder Mid-fusion |
Resumen: | Transformer-based multimodal models often require expensive, full-model training on task-specific all-modality datasets to achieve high accuracy on targeted downstream tasks. To reduce this significant cost, we introduce SAFFE, a methodology for building accurate, task-specific multimodal models with minimal training, using only standard GPU hardware. SAFFE leverages off-the-shelf, pre-trained, frozen unimodal encoders for each input modality (e.g., text, image, or audio) and connects them through a lightweight, trainable component called the FusionAlign Module (FAM). FAM is a bottleneck mid-fusion neural network, trained on the target data set to align the outputs of the independently pre-trained unimodal encoders. This approach eliminates the need for end-to-end training while maintaining strong accuracy for the downstream task. As a proof of concept, we validate SAFFE on image retrieval and language understanding tasks. SAFFE-derived models outperform state-of-the-art multimodal systems on datasets such as CIFAR-10, ImageNet-100, and COCO, achieving competitive results with significantly fewer trainable parameters and training time. |
Autor/es principal/es: | Kulasekara, Maithri Inglés‑Romero, Juan F. Imbernón, Baldomero Abellán, José L. |
Versión del editor: | https://link.springer.com/article/10.1007/s11227-025-07473-7 |
URI: | http://hdl.handle.net/10201/157140 |
DOI: | https://doi.org/10.1007/s11227-025-07473-7 |
Tipo de documento: | info:eu-repo/semantics/article |
Número páginas / Extensión: | 31 |
Derechos: | info:eu-repo/semantics/openAccess Atribución 4.0 Internacional |
Descripción: | © 2025, The Author(s). This manuscript version is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by/4.0/. This document is the Published version of a Published Work that appeared in final form in Journal of Supercomputing. To access the final edited and published work see https://doi.org/10.1007/s11227-025-07473-7 |
Aparece en las colecciones: | Artículos |
Ficheros en este ítem:
Fichero | Descripción | Tamaño | Formato | |
---|---|---|---|---|
s11227-025-07473-7.pdf | 2,92 MB | Adobe PDF | ![]() Visualizar/Abrir |
Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons