Por favor, use este identificador para citar o enlazar este ítem:
https://doi.org/10.1109/MICRO61859.2024.00065


Título: | Temporarily unauthorized stores: write first, ask for permission later |
Fecha de publicación: | 3-dic-2024 |
Editorial: | IEEE Computer Society |
ISBN: | 979-8-3503-5057-9 |
Palabras clave: | Store buffer Write operations Cache coherence protocols Multi core architectures |
Resumen: | x86 processors implement a total store order (x86-TSO) consistency model, which requires stores to update memory in a sequenced manner. The latency of stores is then hidden by the store buffer (SB), which holds stores until the write is performed. On a long latency cache miss, however, stores block the SB, eventually stalling the processor and degrading performance. Contemporary industrial high-performance processors deal with this situation by overprovisioning the size of the SB, but this comes at the cost of energy and latency overheads. In this work, we remove the stalls caused by stores blocked at the head of the SB while reusing existing processor resources, either improving performance when SB size is kept constant or maintaining performance while reducing SB size. Our proposal, Temporarily Unauthorized Stores (TUS), achieves this by extending the functionality of 1) the write combining buffers, to allow them to coalesce stores while maintaining x86- TSO consistency, and 2) immediately write data to the first-level cache upon a miss (i.e., providing an always-hit illusion) but temporarily keeping the written data invisible to the cache coherence protocol, i.e., these stores are temporarily unauthorized. TUS makes temporarily unauthorized stores visible in x86- TSO order without speculation or rollbacks once write permission is obtained. In essence, TUS logically transforms the write combining buffers and the first-level cache into an “extension” of the SB. TUS improves performance by up to 26 % (3.2 % on average) while reducing the total energy-delay-product (EDP) by up to 35.9% (6.4% on average) for SB-bound benchmarks with a 114-entry SB compared to our baseline architecture with an SB of the same size. When configured with a 32-entry SB, TUS yields a performance improvement of 2 % over a 114-entry SB baseline while reducing SB energy per search by a factor of 2 x, SB area by 21 %, and store-to-Ioad forwarding latency from 5 to 3 cycles. |
Autor/es principal/es: | Cebrian, Juan M. Jahre, Magnus Ros Bardisa, Alberto |
Forma parte de: | 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024, November 2-6, Austin, pp. 810-822 |
Versión del editor: | https://www.computer.org/csdl/proceedings-article/micro/2024/505700a810/22nitA2IYeI |
URI: | http://hdl.handle.net/10201/147590 |
DOI: | https://doi.org/10.1109/MICRO61859.2024.00065 |
Tipo de documento: | info:eu-repo/semantics/article |
Número páginas / Extensión: | 13 |
Derechos: | info:eu-repo/semantics/embargoedAccess |
Descripción: | © 2024 IEEE. This document is the Submitted version of a Published Work that appeared in final form in 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). To access the final edited and published work see https://doi.org/10.1109/MICRO61859.2024.00065 |
Aparece en las colecciones: | Artículos |
Ficheros en este ítem:
Fichero | Descripción | Tamaño | Formato | |
---|---|---|---|---|
jcebrian-micro24.pdf | 2,26 MB | Adobe PDF | ![]() Visualizar/Abrir Solicitar una copia |
Los ítems de Digitum están protegidos por copyright, con todos los derechos reservados, a menos que se indique lo contrario.