Scalability Limitations of Processing-in-Memory using RealSystem Evaluations

Gilbert, Jonatan; Haeyoon, Cho; Hyojun, Son; Xiangyu, Wu; Neal, Livesay; Evelio, Mora; Kaustubh, Shivdikar; José L., Abellán; Ajay, Joshi; David, Kaeli; John, Kim

Por favor, use este identificador para citar o enlazar este ítem: https://dl.acm.org/doi/10.1145/3639046

RefMan EndNote BibTex RefWorks Excel CSV PDF Mendeley

Título:	Scalability Limitations of Processing-in-Memory using RealSystem Evaluations
Fecha de publicación:	mar-2024
Editorial:	Association for Computing Machinery (ACM)
Cita bibliográfica:	Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 8, Nº 1, Article 5.
ISSN:	Electronic: 2476-1249
Palabras clave:	Processing-in-memory Interconnection networks Collective communication
Resumen:	Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widelyexplored to accelerate emerging workloads. Recently, different PIM-based systems have been announced bymemory vendors to minimize data movement and improve performance as well as energy efficiency. Onecritical component of PIM is the large amount of compute parallelism provided across many PIM “nodes” orthe compute units near the memory. In this work, we provide an extensive evaluation and analysis of realPIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalabilitychallenges and limitations as the number of PIM nodes increases. In particular, we show how collectivecommunications that are commonly found in many kernels/workloads can be problematic for PIM systems.To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysisof two workloads on the UPMEM PIM system that utilize representative common collective communicationpatterns – AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that arecommonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform(NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-Allcommunication. We analyze the performance benefits of these workloads and show how they can be efficientlymapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unitcan only access its local memory, when communication is necessary between PIM nodes (or remote data isneeded), communication between the compute units must be done through the host CPU, thereby severelyhampering application performance. To increase the scalability (or applicability) of PIM to future workloads,we make the case for how future PIM architectures need efficient communication or interconnection networksbetween the PIM nodes that require both hardware and software support.
Autor/es principal/es:	Gilbert, Jonatan Haeyoon, Cho Hyojun, Son Xiangyu, Wu Neal, Livesay Evelio, Mora Kaustubh, Shivdikar José L., Abellán Ajay, Joshi David, Kaeli John, Kim
URI:	http://hdl.handle.net/10201/139942
DOI:	https://dl.acm.org/doi/10.1145/3639046
Tipo de documento:	info:eu-repo/semantics/article
Número páginas / Extensión:	28
Derechos:	info:eu-repo/semantics/openAccess
Descripción:	©2024. This manuscript version is made available under the CC-BY license https://creativecommons.org/licenses/by/4.0/ This document is the Accepted version of a Published Work that appeared in final form in Proceedings of the ACM on Measurement and Analysis of Computing Systems. To access the final edited and published work see https://dl.acm.org/doi/10.1145/3639046
Aparece en las colecciones:	Artículos

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
UPMEM_Scalability (1).pdf		2,86 MB	Adobe PDF	Visualizar/Abrir

Mostrar el registro Dublin Core completo del ítem Mostrar el registro PREMIS del ítem Estadísticas

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons