No Rush in Executing Atomic Instructions

Asgharzadeh, Ashkan; Feliu, Josué; Acacio, Manuel E.; Kaxiras, Stefanos; Ros, Alberto

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10201/150907

RefMan EndNote BibTex RefWorks Excel CSV PDF Mendeley

Título:	No Rush in Executing Atomic Instructions
Fecha de publicación:	mar-2025
Resumen:	Hardware atomic instructions are the building blocks of the synchronization algorithms. Historically, to guar-antee atomicity and consistency, they were implemented using memory fences, committing older memory instructions, and draining the store buffer before initiating the execution of atomics. Unfortunately, the use of such memory fences entails huge performance penalties as it implies execution serialization, thus impeding instruction- and memory-level parallelism. The situation, however, seems to have changed recently. Through experiments on x86 machines, we discovered that current x86 processors manage to comply with the x86-TSO re-quirements while avoiding the performance overhead introduced by fences (fence-free or unfenced implementation). This paves the way to new potential optimizations to atomic instruction execution. In particular, our simulation experiments modeling unfenced atomics reveal that executing atomic instructions as soon as their operands are ready does not always lead to optimal performance. In fact, this increases the time that other threads should wait to obtain the cacheline. In contended scenarios, delaying the execution of the atomic instruction to minimize the time the cacheline is locked provides superior performance. Based on this observation, we present Rush or Wait (RoW), a hardware mechanism to decide when to execute an atomic instruction. The mechanism is based on a contention predictor that estimates if an atomic will access a contended cacheline. Non-contended atomics execute once their operands are ready. Contended atomics, on the contrary, wait to become the oldest memory instruction and to drain the store buffer to execute, minimizing the contention on the accessed cacheline. Our exper-imental evaluation shows that RoW reduces execution time on average by 9.2% (and up to 43%) compared to a baseline that executes atomics as soon as the operands are ready, and yet it requires a small area overhead (64 bytes).
Autor/es principal/es:	Asgharzadeh, Ashkan Feliu, Josué Acacio, Manuel E. Kaxiras, Stefanos Ros, Alberto
Forma parte de:	2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
URI:	http://hdl.handle.net/10201/150907
Tipo de documento:	info:eu-repo/semantics/preprint
Derechos:	info:eu-repo/semantics/openAccess Attribution-NonCommercial-NoDerivatives 4.0 Internacional
Aparece en las colecciones:	Artículos

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
preprint_RoW_HPCA_2025.pdf		723,84 kB	Adobe PDF	Visualizar/Abrir

Mostrar el registro Dublin Core completo del ítem Mostrar el registro PREMIS del ítem Estadísticas

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons