Por favor, use este identificador para citar o enlazar este ítem:
http://hdl.handle.net/10201/150907


Título: | No Rush in Executing Atomic Instructions |
Fecha de publicación: | mar-2025 |
Resumen: | Hardware atomic instructions are the building blocks of the synchronization algorithms. Historically, to guar-antee atomicity and consistency, they were implemented using memory fences, committing older memory instructions, and draining the store buffer before initiating the execution of atomics. Unfortunately, the use of such memory fences entails huge performance penalties as it implies execution serialization, thus impeding instruction- and memory-level parallelism. The situation, however, seems to have changed recently. Through experiments on x86 machines, we discovered that current x86 processors manage to comply with the x86-TSO re-quirements while avoiding the performance overhead introduced by fences (fence-free or unfenced implementation). This paves the way to new potential optimizations to atomic instruction execution. In particular, our simulation experiments modeling unfenced atomics reveal that executing atomic instructions as soon as their operands are ready does not always lead to optimal performance. In fact, this increases the time that other threads should wait to obtain the cacheline. In contended scenarios, delaying the execution of the atomic instruction to minimize the time the cacheline is locked provides superior performance. Based on this observation, we present Rush or Wait (RoW), a hardware mechanism to decide when to execute an atomic instruction. The mechanism is based on a contention predictor that estimates if an atomic will access a contended cacheline. Non-contended atomics execute once their operands are ready. Contended atomics, on the contrary, wait to become the oldest memory instruction and to drain the store buffer to execute, minimizing the contention on the accessed cacheline. Our exper-imental evaluation shows that RoW reduces execution time on average by 9.2% (and up to 43%) compared to a baseline that executes atomics as soon as the operands are ready, and yet it requires a small area overhead (64 bytes). |
Autor/es principal/es: | Asgharzadeh, Ashkan Feliu, Josué Acacio, Manuel E. Kaxiras, Stefanos Ros, Alberto |
Forma parte de: | 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA) |
URI: | http://hdl.handle.net/10201/150907 |
Tipo de documento: | info:eu-repo/semantics/preprint |
Derechos: | info:eu-repo/semantics/openAccess Attribution-NonCommercial-NoDerivatives 4.0 Internacional |
Aparece en las colecciones: | Artículos |
Ficheros en este ítem:
Fichero | Descripción | Tamaño | Formato | |
---|---|---|---|---|
preprint_RoW_HPCA_2025.pdf | 723,84 kB | Adobe PDF | ![]() Visualizar/Abrir |
Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons