Por favor, use este identificador para citar o enlazar este ítem:
https://doi.org/10.1007/s11227-025-07473-7


Registro completo de metadatos
Campo DC | Valor | Lengua/Idioma |
---|---|---|
dc.contributor.author | Kulasekara, Maithri | - |
dc.contributor.author | Inglés‑Romero, Juan F. | - |
dc.contributor.author | Imbernón, Baldomero | - |
dc.contributor.author | Abellán, José L. | - |
dc.date.accessioned | 2025-07-09T10:49:02Z | - |
dc.date.available | 2025-07-09T10:49:02Z | - |
dc.date.issued | 2025-07-07 | - |
dc.identifier.citation | The Journal of Supercomputing (2025) 81:1114 | es |
dc.identifier.issn | Print: 0920-8542 | - |
dc.identifier.issn | Electronic: 1573-0484 | - |
dc.identifier.uri | http://hdl.handle.net/10201/157140 | - |
dc.description | © 2025, The Author(s). This manuscript version is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by/4.0/. This document is the Published version of a Published Work that appeared in final form in Journal of Supercomputing. To access the final edited and published work see https://doi.org/10.1007/s11227-025-07473-7 | es |
dc.description.abstract | Transformer-based multimodal models often require expensive, full-model training on task-specific all-modality datasets to achieve high accuracy on targeted downstream tasks. To reduce this significant cost, we introduce SAFFE, a methodology for building accurate, task-specific multimodal models with minimal training, using only standard GPU hardware. SAFFE leverages off-the-shelf, pre-trained, frozen unimodal encoders for each input modality (e.g., text, image, or audio) and connects them through a lightweight, trainable component called the FusionAlign Module (FAM). FAM is a bottleneck mid-fusion neural network, trained on the target data set to align the outputs of the independently pre-trained unimodal encoders. This approach eliminates the need for end-to-end training while maintaining strong accuracy for the downstream task. As a proof of concept, we validate SAFFE on image retrieval and language understanding tasks. SAFFE-derived models outperform state-of-the-art multimodal systems on datasets such as CIFAR-10, ImageNet-100, and COCO, achieving competitive results with significantly fewer trainable parameters and training time. | es |
dc.format | application/pdf | es |
dc.format.extent | 31 | es |
dc.language | eng | es |
dc.publisher | Springer | es |
dc.relation | This work has been funded by MICIU/AEI/10.13039/501100011033 and by “Euro pean Union NextGenerationEU/PRTR” under the grants CNS2023-144241 and RYC2021-031966-I. | es |
dc.rights | info:eu-repo/semantics/openAccess | es |
dc.rights | Atribución 4.0 Internacional | * |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | * |
dc.subject | Multimodal fusion | es |
dc.subject | Frozen model | es |
dc.subject | Transformer encoder | es |
dc.subject | Decoder | es |
dc.subject | Mid-fusion | es |
dc.title | Saffe: multimodal model composition with semantic‑alignment fusion of frozen encoders | es |
dc.type | info:eu-repo/semantics/article | es |
dc.relation.publisherversion | https://link.springer.com/article/10.1007/s11227-025-07473-7 | es |
dc.identifier.doi | https://doi.org/10.1007/s11227-025-07473-7 | - |
dc.contributor.department | Departamento de Ingeniería y Tecnología de Computadores | es |
Aparece en las colecciones: | Artículos |
Ficheros en este ítem:
Fichero | Descripción | Tamaño | Formato | |
---|---|---|---|---|
s11227-025-07473-7.pdf | 2,92 MB | Adobe PDF | ![]() Visualizar/Abrir |
Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons