Temporarily unauthorized stores: write first, ask for permission later

Cebrian, Juan M.; Jahre, Magnus; Ros Bardisa, Alberto

Por favor, use este identificador para citar o enlazar este ítem: https://doi.org/10.1109/MICRO61859.2024.00065

RefMan EndNote BibTex RefWorks Excel CSV PDF Mendeley

Título:	Temporarily unauthorized stores: write first, ask for permission later
Fecha de publicación:	3-dic-2024
Editorial:	IEEE Computer Society
ISBN:	979-8-3503-5057-9
Palabras clave:	Store buffer Write operations Cache coherence protocols Multi core architectures
Resumen:	x86 processors implement a total store order (x86-TSO) consistency model, which requires stores to update memory in a sequenced manner. The latency of stores is then hidden by the store buffer (SB), which holds stores until the write is performed. On a long latency cache miss, however, stores block the SB, eventually stalling the processor and degrading performance. Contemporary industrial high-performance processors deal with this situation by overprovisioning the size of the SB, but this comes at the cost of energy and latency overheads. In this work, we remove the stalls caused by stores blocked at the head of the SB while reusing existing processor resources, either improving performance when SB size is kept constant or maintaining performance while reducing SB size. Our proposal, Temporarily Unauthorized Stores (TUS), achieves this by extending the functionality of 1) the write combining buffers, to allow them to coalesce stores while maintaining x86- TSO consistency, and 2) immediately write data to the first-level cache upon a miss (i.e., providing an always-hit illusion) but temporarily keeping the written data invisible to the cache coherence protocol, i.e., these stores are temporarily unauthorized. TUS makes temporarily unauthorized stores visible in x86- TSO order without speculation or rollbacks once write permission is obtained. In essence, TUS logically transforms the write combining buffers and the first-level cache into an “extension” of the SB. TUS improves performance by up to 26 % (3.2 % on average) while reducing the total energy-delay-product (EDP) by up to 35.9% (6.4% on average) for SB-bound benchmarks with a 114-entry SB compared to our baseline architecture with an SB of the same size. When configured with a 32-entry SB, TUS yields a performance improvement of 2 % over a 114-entry SB baseline while reducing SB energy per search by a factor of 2 x, SB area by 21 %, and store-to-Ioad forwarding latency from 5 to 3 cycles.
Autor/es principal/es:	Cebrian, Juan M. Jahre, Magnus Ros Bardisa, Alberto
Forma parte de:	57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024, November 2-6, Austin, pp. 810-822
Versión del editor:	https://www.computer.org/csdl/proceedings-article/micro/2024/505700a810/22nitA2IYeI
URI:	http://hdl.handle.net/10201/147590
DOI:	https://doi.org/10.1109/MICRO61859.2024.00065
Tipo de documento:	info:eu-repo/semantics/article
Número páginas / Extensión:	13
Derechos:	info:eu-repo/semantics/embargoedAccess
Descripción:	© 2024 IEEE. This document is the Submitted version of a Published Work that appeared in final form in 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). To access the final edited and published work see https://doi.org/10.1109/MICRO61859.2024.00065
Aparece en las colecciones:	Artículos

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
jcebrian-micro24.pdf		2,26 MB	Adobe PDF	Visualizar/Abrir Solicitar una copia

Mostrar el registro Dublin Core completo del ítem Mostrar el registro PREMIS del ítem Estadísticas