Por favor, use este identificador para citar o enlazar este ítem: https://doi.org/10.1007/s11227-025-07473-7

Registro completo de metadatos
Campo DCValorLengua/Idioma
dc.contributor.authorKulasekara, Maithri-
dc.contributor.authorInglés‑Romero, Juan F.-
dc.contributor.authorImbernón, Baldomero-
dc.contributor.authorAbellán, José L.-
dc.date.accessioned2025-07-09T10:49:02Z-
dc.date.available2025-07-09T10:49:02Z-
dc.date.issued2025-07-07-
dc.identifier.citationThe Journal of Supercomputing (2025) 81:1114es
dc.identifier.issnPrint: 0920-8542-
dc.identifier.issnElectronic: 1573-0484-
dc.identifier.urihttp://hdl.handle.net/10201/157140-
dc.description© 2025, The Author(s). This manuscript version is made available under the CC-BY 4.0 license http://creativecommons.org/licenses/by/4.0/. This document is the Published version of a Published Work that appeared in final form in Journal of Supercomputing. To access the final edited and published work see https://doi.org/10.1007/s11227-025-07473-7es
dc.description.abstractTransformer-based multimodal models often require expensive, full-model training on task-specific all-modality datasets to achieve high accuracy on targeted downstream tasks. To reduce this significant cost, we introduce SAFFE, a methodology for building accurate, task-specific multimodal models with minimal training, using only standard GPU hardware. SAFFE leverages off-the-shelf, pre-trained, frozen unimodal encoders for each input modality (e.g., text, image, or audio) and connects them through a lightweight, trainable component called the FusionAlign Module (FAM). FAM is a bottleneck mid-fusion neural network, trained on the target data set to align the outputs of the independently pre-trained unimodal encoders. This approach eliminates the need for end-to-end training while maintaining strong accuracy for the downstream task. As a proof of concept, we validate SAFFE on image retrieval and language understanding tasks. SAFFE-derived models outperform state-of-the-art multimodal systems on datasets such as CIFAR-10, ImageNet-100, and COCO, achieving competitive results with significantly fewer trainable parameters and training time.es
dc.formatapplication/pdfes
dc.format.extent31es
dc.languageenges
dc.publisherSpringeres
dc.relationThis work has been funded by MICIU/AEI/10.13039/501100011033 and by “Euro pean Union NextGenerationEU/PRTR” under the grants CNS2023-144241 and RYC2021-031966-I.es
dc.rightsinfo:eu-repo/semantics/openAccesses
dc.rightsAtribución 4.0 Internacional*
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/*
dc.subjectMultimodal fusiones
dc.subjectFrozen modeles
dc.subjectTransformer encoderes
dc.subjectDecoderes
dc.subjectMid-fusiones
dc.titleSaffe: multimodal model composition with semantic‑alignment fusion of frozen encoderses
dc.typeinfo:eu-repo/semantics/articlees
dc.relation.publisherversionhttps://link.springer.com/article/10.1007/s11227-025-07473-7es
dc.identifier.doihttps://doi.org/10.1007/s11227-025-07473-7-
dc.contributor.departmentDepartamento de Ingeniería y Tecnología de Computadoreses
Aparece en las colecciones:Artículos

Ficheros en este ítem:
Fichero Descripción TamañoFormato 
s11227-025-07473-7.pdf2,92 MBAdobe PDFVista previa
Visualizar/Abrir


Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons Creative Commons