Publication: No Rush in Executing Atomic Instructions
Authors
Asgharzadeh, Ashkan ; Feliu, Josué ; Acacio, Manuel E. ; Kaxiras, Stefanos ; Ros, Alberto
item.page.secondaryauthor
item.page.director
Publisher
publication.page.editor
publication.page.department
DOI
item.page.type
info:eu-repo/semantics/preprint
Description
Abstract
Hardware atomic instructions are the building blocks of the synchronization algorithms. Historically, to guar-antee atomicity and consistency, they were implemented using memory fences, committing older memory instructions, and draining the store buffer before initiating the execution of atomics. Unfortunately, the use of such memory fences entails huge performance penalties as it implies execution serialization, thus impeding instruction- and memory-level parallelism.
The situation, however, seems to have changed recently. Through experiments on x86 machines, we discovered that current x86 processors manage to comply with the x86-TSO re-quirements while avoiding the performance overhead introduced by fences (fence-free or unfenced implementation). This paves the way to new potential optimizations to atomic instruction execution. In particular, our simulation experiments modeling unfenced atomics reveal that executing atomic instructions as soon as their operands are ready does not always lead to optimal performance. In fact, this increases the time that other threads should wait to obtain the cacheline. In contended scenarios, delaying the execution of the atomic instruction to minimize the time the cacheline is locked provides superior performance.
Based on this observation, we present Rush or Wait (RoW), a hardware mechanism to decide when to execute an atomic instruction. The mechanism is based on a contention predictor that estimates if an atomic will access a contended cacheline. Non-contended atomics execute once their operands are ready. Contended atomics, on the contrary, wait to become the oldest memory instruction and to drain the store buffer to execute, minimizing the contention on the accessed cacheline. Our exper-imental evaluation shows that RoW reduces execution time on average by 9.2% (and up to 43%) compared to a baseline that executes atomics as soon as the operands are ready, and yet it requires a small area overhead (64 bytes).
publication.page.subject
Citation
item.page.embargo
Collections
Ir a EstadĂsticas
Este Ătem está sujeto a una licencia Creative Commons. http://creativecommons.org/licenses/by-nc-nd/4.0/