Por favor, use este identificador para citar o enlazar este ítem: https://doi.org/10.1109/ICCD63220.2024.00092

Título: Hardware cache locking for all memory updates
Fecha de publicación: 2024
Editorial: IEEE Computer Society
ISBN: 979-8-3503-8040-8
Palabras clave: Multi core architectures
Micro architecture
Non atomic read modify write
False sharing
Hardware cache locking
Resumen: Many applications need to perform operations thatinvolve reading a value from memory, modifying it, and thenwriting it back. Multiple architectures provide hardware supportfor these operations via read-modify-write (RMW) instructions.The primary benefit is that the read can request a cacheline withwrite permissions, reducing coherence protocol overhead sincethe write will find the cacheline with appropriate permissions.RMWs can be either atomic or non-atomic. Atomic RMWs, usedfor synchronization, commonly require (i) locking the cacheline toguarantee atomicity by preventing invalidations and (ii) enforcingserialization of instructions in the program (e.g., via memoryfences), which may cause performance degradation based onthe implemented memory consistency model. Non-atomic RMWs,while not requiring such strict measures, should only be used indata-race free code sections. However, other cores may invalidatea cacheline during a non-atomic RMW (e.g., due to false sharing),flushing the pipeline and causing the loss of write permissionsobtained by the read, which is detrimental to performance.In this work, we propose a microarchitectural mechanismthat enables non-atomic RMWs to fetch the cacheline lockingit, thus preventing other cores from “stealing” the cachelinewhile allowing them to run concurrently with other instructionsin the same core. Our proposal enables concurrent hardwarecache locking for multiple non-atomic RMWs while guaranteeingdeadlock freedom and no programmer/compiler intervention.We also propose alock-chainingmechanism to allow multipleconsecutive memory updates to the same cacheline up to apredefined maximum (to prevent starvation and load imbalance).Our evaluation using gem5 full-system simulator shows that foran eight-core configuration, our proposal improves performanceby up to 5.36% (2.05% on average), requiring just 45 bytes ofstorage per core.
Autor/es principal/es: Asgharzadeh, Ashkan
Gómez Hernández, Eduardo José
Cebrián, Juan M.
Kaxiras, Stefanos
Ros Bardisa, Alberto
Forma parte de: 42th IEEE International Conference on Computer Design (ICCD 2024), November 18-20, Milan, pp. 566--574
URI: http://hdl.handle.net/10201/147577
DOI: https://doi.org/10.1109/ICCD63220.2024.00092
Tipo de documento: info:eu-repo/semantics/article
Número páginas / Extensión: 9
Derechos: info:eu-repo/semantics/embargoedAccess
Descripción: © 2024 IEEE. This document is the Submitted Published version of a Published Work that appeared in final form in 42th IEEE International Conference on Computer Design (ICCD 2024). To access the final edited and published work see https://doi.org/10.1109/ICCD63220.2024.00092
Aparece en las colecciones:Artículos

Ficheros en este ítem:
Fichero Descripción TamañoFormato 
aasgharzadeh-iccd24.pdf339,17 kBAdobe PDFVista previa
Visualizar/Abrir    Solicitar una copia


Los ítems de Digitum están protegidos por copyright, con todos los derechos reservados, a menos que se indique lo contrario.