Por favor, use este identificador para citar o enlazar este ítem:
https://doi.org/10.1109/ICCD63220.2024.00092


Título: | Hardware cache locking for all memory updates |
Fecha de publicación: | 2024 |
Editorial: | IEEE Computer Society |
ISBN: | 979-8-3503-8040-8 |
Palabras clave: | Multi core architectures Micro architecture Non atomic read modify write False sharing Hardware cache locking |
Resumen: | Many applications need to perform operations thatinvolve reading a value from memory, modifying it, and thenwriting it back. Multiple architectures provide hardware supportfor these operations via read-modify-write (RMW) instructions.The primary benefit is that the read can request a cacheline withwrite permissions, reducing coherence protocol overhead sincethe write will find the cacheline with appropriate permissions.RMWs can be either atomic or non-atomic. Atomic RMWs, usedfor synchronization, commonly require (i) locking the cacheline toguarantee atomicity by preventing invalidations and (ii) enforcingserialization of instructions in the program (e.g., via memoryfences), which may cause performance degradation based onthe implemented memory consistency model. Non-atomic RMWs,while not requiring such strict measures, should only be used indata-race free code sections. However, other cores may invalidatea cacheline during a non-atomic RMW (e.g., due to false sharing),flushing the pipeline and causing the loss of write permissionsobtained by the read, which is detrimental to performance.In this work, we propose a microarchitectural mechanismthat enables non-atomic RMWs to fetch the cacheline lockingit, thus preventing other cores from “stealing” the cachelinewhile allowing them to run concurrently with other instructionsin the same core. Our proposal enables concurrent hardwarecache locking for multiple non-atomic RMWs while guaranteeingdeadlock freedom and no programmer/compiler intervention.We also propose alock-chainingmechanism to allow multipleconsecutive memory updates to the same cacheline up to apredefined maximum (to prevent starvation and load imbalance).Our evaluation using gem5 full-system simulator shows that foran eight-core configuration, our proposal improves performanceby up to 5.36% (2.05% on average), requiring just 45 bytes ofstorage per core. |
Autor/es principal/es: | Asgharzadeh, Ashkan Gómez Hernández, Eduardo José Cebrián, Juan M. Kaxiras, Stefanos Ros Bardisa, Alberto |
Forma parte de: | 42th IEEE International Conference on Computer Design (ICCD 2024), November 18-20, Milan, pp. 566--574 |
URI: | http://hdl.handle.net/10201/147577 |
DOI: | https://doi.org/10.1109/ICCD63220.2024.00092 |
Tipo de documento: | info:eu-repo/semantics/article |
Número páginas / Extensión: | 9 |
Derechos: | info:eu-repo/semantics/embargoedAccess |
Descripción: | © 2024 IEEE. This document is the Submitted Published version of a Published Work that appeared in final form in 42th IEEE International Conference on Computer Design (ICCD 2024). To access the final edited and published work see https://doi.org/10.1109/ICCD63220.2024.00092 |
Aparece en las colecciones: | Artículos |
Ficheros en este ítem:
Fichero | Descripción | Tamaño | Formato | |
---|---|---|---|---|
aasgharzadeh-iccd24.pdf | 339,17 kB | Adobe PDF | ![]() Visualizar/Abrir Solicitar una copia |
Los ítems de Digitum están protegidos por copyright, con todos los derechos reservados, a menos que se indique lo contrario.