Advancements towards non-speculative concurrent execution of critical sections

Gómez Hernández, Eduardo José

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10201/155797

RefMan EndNote BibTex RefWorks Excel CSV PDF Mendeley

Título:	Advancements towards non-speculative concurrent execution of critical sections
Otros títulos:	Avances hacia la ejecución concurrente y no especulativa de secciones críticas
Fecha de publicación:	10-jun-2025
Fecha de defensa / creación:	3-jun-2025
Editorial:	Universidad de Murcia
Materias relacionadas:	CDU::0 - Generalidades.::00 - Ciencia y conocimiento. Investigación. Cultura. Humanidades.::004 - Ciencia y tecnología de los ordenadores. Informática.
Palabras clave:	Informática
Resumen:	Los programas paralelos requieren, además de la orquestación de la caché, otro mecanismo que garantice la sincronización entre otros hilos del mismo programa. Estos mecanismos de sincronización inducirán sobrecargas, al ralentizar ciertas operaciones, detener hilos, entre muchos otros, para cumplir con los requisitos establecidos por el programador. Un gran problema a la hora de evaluar nuevas propuestas es la falta de líneas de comparación actualizadas y representativas. La más utilizada son las suites de benchmarks, pero la mayoría de ellas fueron creadas hace más de 20 años. Aunque algunas de ellas intentan mantenerse al día con los cambios arquitectónicos de los procesadores, muchas se quedan atrás. Una de las capacidades más tergiversadas son, los ya mencionados, mecanismos de sincronización. Las regiones de código que requieren sincronización se conocen como secciones críticas (o regiones atómicas). Dependiendo de la naturaleza y propiedades de estas secciones, se pueden utilizar diferentes enfoques para protegerlas. Las pequeñas con pocas direcciones, grano fino, tienden a ser más eficientes, pero mucho más difíciles de desarrollar y depurar; mientras que las grandes con varias direcciones, grano grueso, son triviales, incluso automatizadas, menos propensas a errores, pero mucho menos performantes. A lo largo de los años se han desarrollado diferentes métodos para facilitar las secciones críticas de grano grueso de alto rendimiento. La memoria transaccional por hardware y la elisión especulativa de bloqueos son los más conocidos en este contexto. El principal problema es que introducen muchos retos para los diseñadores de hardware, al tiempo que siguen mostrando dudas sobre si son un buen enfoque. El objetivo de la tesis es la ejecución eficiente de secciones críticas, es decir, regiones de código que deben ejecutarse atómicamente. El método más eficiente es la ejecución concurrente y no especulativa de estas secciones. Para lograrlo, presentamos los 3 pasos que hemos dado: 1) se pueden utilizar instrucciones atómicas simples para implementar secciones críticas no especulativas, por lo que desarrollamos una versión actualizada del conocido conjunto de pruebas Splash que utiliza instrucciones atómicas de dirección única para implementar la mayoría de las secciones críticas (Splash-4); 2) un nuevo conjunto de instrucciones atómicas multidirección, y una metodología sobre cómo implementarlas eficientemente, que pueden utilizarse para secciones críticas pequeñas (MADs); 3) sin intervención directa del programador, un método más genérico que limita los reintentos necesarios para ejecutar regiones críticas contendidas (CLEAR). Para una eficiente evaluación de los resultados, hemos utilizado las herramientas mas actualizadas que nos han sido posibles en cada caso, e incluso, cuando ha sido posible, máquinas reales en lugar de simulaciones. Para las simulaciones, hemos usado el simulador gem5, en todo momento realizando multiples ejecuciones. El simulador ha sido configurado para emular, lo más fiablemente posible, procesadores basados en las últimas generaciones de intel. En nuestro primer paso, Splash-4, hemos conseguido reducir el tiempo de ejecución al usar 64-cores en un 50%, manteniendo en todo momento la estructura y algoritmos originales. En el segundo objectivo (MADs), las nuevas instrucciones atomicas implementadas, reducen un 80% el tiempo de ejecución al compararse con el mecanismo de locks clásico, y un 60% al usar una tecnica de memoria transacional similar a intel TSX, añadiendo solo 68 bytes por core. Por último, CLEAR, es capaz de limitar la cantidad de rejecuciones de las secciones criticas ejecutadas bajo métodos especulativos, aumentando en un 35% la cantidad de secciones que se completan en el primer reintento, y reduciendo del 37% al 15% la cantidad de secciones que requieren llegar al fallback. Todo esto mejorando el timepo de ejecución en un 35% contra una implementación tipo Intel TSX y un 23% contra PowerTM. Parallel programs require, besides the cache orchestration, another mechanism that guarantees synchronization among other threads of the same program. These synchronization mechanisms will induce overheads, by slowing down certain operations and stalling threads, among many others, to comply with the requirements established by the programmer. A big issue when evaluating new proposals is the lack of comparison lines that are up-to-date and representative. The most used ones are benchmark suites, but most of them were crafted over 20 years ago. While some of them try to keep up with the architectural changes of the processors, many are left behind. One of the most misrepresented capabilities is the synchronization mechanisms. The code regions that require synchronization are known as critical sections (or atomic regions). Depending on the nature and properties of these sections, different approaches can be used to protect them. Small ones with few addresses, ``fine-grain'', tend to be more efficient, but much harder to develop and debug; while big ones with several addresses, ``coarse-grain'', are trivial, even automatized, less prone to errors, but much less performant. Different methods have been developed over the years to facilitate high-performance coarse-grain critical sections. Hardware Transactional Memory and Speculative Lock Elision are the most famous ones in this context. The main issue is that they introduce a lot of challenges to hardware designers while still showing doubts if they are a good approach. The thesis's objective is the efficient execution of critical sections, that is, regions of code that must be executed atomically. The most efficient method is the concurrent and non-speculative executions of these sections. To achieve this, we present the 3 steps we have taken: 1) single-atomic instructions can be used to implement non-speculative critical sections, therefore, we develop an updated version of the well-known Splash benchmark suite that uses single-address atomic instructions to implement most of the critical sections (Splash-4); 2) a new set of multi-address atomic instructions, and a methodology on how to efficiently implement them, that can be used for small critical sections (MADs); 3) without the direct intervention of the programmer, a more generic method that limits the retries required to execute contended critical regions (CLEAR). For an efficient evaluation of the results, we have used the most up-to-date tools possible in each case, and even, when possible, real machines instead of simulations. For the simulations, we have used the gem5 simulator, at all times performing multiple runs. The simulator has been configured to emulate, as reliably as possible, processors based on the latest intel generations. In our first step, Splash-4, we have managed to reduce the execution time by using 64-cores by 50%, while maintaining the original structure and algorithms. In the second objective (MADs), the new atomic instructions implemented, reduce execution time by 80% compared to the classical lock mechanism, and by 60% by using a transitional memory technique similar to intel TSX, adding only 68 bytes per core. Finally, CLEAR is able to limit the number of re-executions of critical sections executed under speculative methods, increasing by 35% the number of sections that complete on the first retry, and reducing from 37% to 15% the number of sections that need to reach fallback. All this improving the execution time by 35% against an Intel TSX implementation and 23% against PowerTM.
Autor/es principal/es:	Gómez Hernández, Eduardo José
Director/es:	Ros Bardisa, Alberto Kaxiras, Stefanos
Facultad/Servicios:	Escuela Internacional de Doctorado
Forma parte de:	Proyecto de investigación:
URI:	http://hdl.handle.net/10201/155797
Tipo de documento:	info:eu-repo/semantics/doctoralThesis
Número páginas / Extensión:	121
Derechos:	info:eu-repo/semantics/openAccess Attribution-NonCommercial-NoDerivatives 4.0 Internacional
Aparece en las colecciones:	Ingeniería

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
Gomez-Hernandez-EduardoJose_TD_2025_SinArticulos correg.pdf		941,41 kB	Adobe PDF	Visualizar/Abrir
Gomez-Hernandez-EduardoJose_TD_2025 correg.pdf		2,25 MB	Adobe PDF	Visualizar/Abrir Solicitar una copia

Mostrar el registro Dublin Core completo del ítem Mostrar el registro PREMIS del ítem Estadísticas

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons