Por favor, use este identificador para citar o enlazar este ítem: https://dl.acm.org/doi/10.1145/3639046

Título: Scalability Limitations of Processing-in-Memory using RealSystem Evaluations
Fecha de publicación: mar-2024
Editorial: Association for Computing Machinery (ACM)
Cita bibliográfica: Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 8, Nº 1, Article 5.
ISSN: Electronic: 2476-1249
Palabras clave: Processing-in-memory
Interconnection networks
Collective communication
Resumen: Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widelyexplored to accelerate emerging workloads. Recently, different PIM-based systems have been announced bymemory vendors to minimize data movement and improve performance as well as energy efficiency. Onecritical component of PIM is the large amount of compute parallelism provided across many PIM “nodes” orthe compute units near the memory. In this work, we provide an extensive evaluation and analysis of realPIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalabilitychallenges and limitations as the number of PIM nodes increases. In particular, we show how collectivecommunications that are commonly found in many kernels/workloads can be problematic for PIM systems.To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysisof two workloads on the UPMEM PIM system that utilize representative common collective communicationpatterns – AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that arecommonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform(NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-Allcommunication. We analyze the performance benefits of these workloads and show how they can be efficientlymapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unitcan only access its local memory, when communication is necessary between PIM nodes (or remote data isneeded), communication between the compute units must be done through the host CPU, thereby severelyhampering application performance. To increase the scalability (or applicability) of PIM to future workloads,we make the case for how future PIM architectures need efficient communication or interconnection networksbetween the PIM nodes that require both hardware and software support.
Autor/es principal/es: Gilbert, Jonatan
Haeyoon, Cho
Hyojun, Son
Xiangyu, Wu
Neal, Livesay
Evelio, Mora
Kaustubh, Shivdikar
José L., Abellán
Ajay, Joshi
David, Kaeli
John, Kim
Facultad/Departamentos/Servicios: Facultades, Departamentos, Servicios y Escuelas::Departamentos de la UMU::Ingeniería y Tecnología de Computadores
URI: http://hdl.handle.net/10201/139942
DOI: https://dl.acm.org/doi/10.1145/3639046
Tipo de documento: info:eu-repo/semantics/article
Número páginas / Extensión: 28
Derechos: info:eu-repo/semantics/openAccess
Descripción: ©2024. This manuscript version is made available under the CC-BY license https://creativecommons.org/licenses/by/4.0/ This document is the Accepted version of a Published Work that appeared in final form in Proceedings of the ACM on Measurement and Analysis of Computing Systems. To access the final edited and published work see https://dl.acm.org/doi/10.1145/3639046
Aparece en las colecciones:Artículos: Ingeniería y Tecnología de Computadores

Ficheros en este ítem:
Fichero Descripción TamañoFormato 
UPMEM_Scalability (1).pdf2,86 MBAdobe PDFVista previa
Visualizar/Abrir


Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons Creative Commons