Compiler-Assisted Compaction/Restoration of SIMD Instructions

Cebrián González, Juan Manuel; Balem, Thibaud; Barredo, Adrian; Casas, Marc; Moreto, Miquel; Ros, Alberto; Jimborean, Alexandra

Por favor, use este identificador para citar o enlazar este ítem: 10.1109/TPDS.2021.3091015

RefMan EndNote BibTex RefWorks Excel CSV PDF Mendeley

Título:	Compiler-Assisted Compaction/Restoration of SIMD Instructions
Fecha de publicación:	abr-2022
Cita bibliográfica:	IEEE Transactions on Parallel and Distributed Systems (TPDS), 33 (4),
ISSN:	1045-9219
Palabras clave:	SIMD Predication LLVM Density-time performance
Resumen:	Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This paper proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent dense vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29% and reduces dynamic energy by up to 24.2% on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6% performance and 14% energy improvements for the same configuration and applications.
Autor/es principal/es:	Cebrián González, Juan Manuel Balem, Thibaud Barredo, Adrian Casas, Marc Moreto, Miquel Ros, Alberto Jimborean, Alexandra
URI:	http://hdl.handle.net/10201/114647
DOI:	10.1109/TPDS.2021.3091015
Tipo de documento:	info:eu-repo/semantics/article
Número páginas / Extensión:	12
Derechos:	info:eu-repo/semantics/openAccess Atribución 4.0 Internacional
Aparece en las colecciones:	Artículos

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
jcebrian-tpds22.pdf		1,97 MB	Adobe PDF	Visualizar/Abrir

Mostrar el registro Dublin Core completo del ítem Mostrar el registro PREMIS del ítem Estadísticas

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons