DigitalUM :: Browsing by Subject "GPU"

Browsing by Subject "GPU"

Now showing 1 - 7 of 7

Open Access
Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs
(ACM, 2023-10-21) Joseph, D.; Aragón, J.L.; Parcerisa, J.M.; González, A.; Ingeniería y Tecnología de Computadores
Literature is plentiful in works exploiting cache locality for GPUs. A majority of them explore replacement or bypassing policies. In this paper, however, we surpass this exploration by fabricating a formal proof for a no-overhead quasi-optimal caching technique for caching textures in graphics workloads. Textures make up a significant part of main memory traffic in mobile GPUs, which contributes to the total GPU energy consumption. Since texture accesses use a shared L2 cache, improving the L2 texture caching efficiency would decrease main memory traffic, thus improving energy efficiency, which is crucial for mobile GPUs. Our proposal reaches quasi-optimality by exploiting the frame-to-frame reuse of textures in graphics. We do this by traversing frames in a boustrophedonic1 manner w.r.t. the frame-to-frame tile order. We first approximate the texture access trace to a circular trace and then forge a formal proof for our proposal being optimal for such traces. We also complement the proof with empirical data that demonstrates the quasi-optimality of our no-cost proposal.
Open Access
Characterization of machine learning compilers for LLM inference on NVIDIA GPUs
(Springer, 2026-05-15) Bernabé García, Gregorio; José Manuel García; Carmona Martínez, Alejandro; Ingeniería y Tecnología de Computadores; Facultades de la UMU::Facultad de Informática
AI inference is conflicted between Performance, developer Productivity, and device Portability–the P3 problem. Machine learning compilers (MLCs) aim to address this, but their ecosystem is fragmented, with tools that each prioritize a different issue. This paper evaluates the deployment trade-offs of PyTorch-based LLMs on NVIDIA GPUs using four intertwined prominent MLC tools: torch.compile, TensorRT, XLA, and ONNX Runtime. A dual methodology is used, leveraging synthetic PyTorch models to isolate optimizations and end-to-end benchmarks with State-of-the-Art (SOTA) models (TinyLlama-1.1B, Llama-2-7B) to measure realworld performance. Findings reveal that the peak performance of Ahead-Of-Time (AOT) compilation requires architecture-specific tools such as TensorRT-LLM, which are necessary for SOTA LLMs but are unusable for PyTorch models. As for Just-In-Time (JIT) solutions such as torch.compile and its backends, they are flexible and portable, compatible with all tested models, but they do not consistently accelerate LLMs; therefore, the choice of MLC depends on P3 considerations and model architecture.
Open Access
Exploiting heterogeneous parallelism on hybrid metaheuristics for vector autoregression models
(MDPI, 2020-10-27) Cuenca Muñoz, Antonio Javier; Cutillas Lozano, José Matías; Giménez, Domingo; Pérez Bernabeu, Alberto; López Espín, José J.; Ingeniería y Tecnología de Computadores
In the last years, the huge amount of data available in many disciplines makes the mathematical modeling, and, more concretely, econometric models, a very important technique to explain those data. One of the most used of those econometric techniques is the Vector Autoregression Models (VAR) which are multi-equation models that linearly describe the interactions and behavior of a group of variables by using their past. Traditionally, Ordinary Least Squares and Maximum likelihood estimators have been used in the estimation of VAR models. These techniques are consistent and asymptotically efficient under ideal conditions of the data and the identification problem. Otherwise, these techniques would yield inconsistent parameter estimations. This paper considers the estimation of a VAR model by minimizing the difference between the dependent variables in a certain time, and the expression of their own past and the exogenous variables of the model (in this case denoted as VARX model). The solution of this optimization problem is approached through hybrid metaheuristics. The high computational cost due to the huge amount of data makes it necessary to exploit High-Performance Computing for the acceleration of methods to obtain the models. The parameterized, parallel implementation of the metaheuristics and the matrix formulation ease the simultaneous exploitation of parallelism for groups of hybrid metaheuristics. Multilevel and heterogeneous parallelism are exploited in multicore CPU plus multiGPU nodes, with the optimum combination of the different parallelism parameters depending on the particular metaheuristic and the problem it is applied to.
Open Access
Griffin: hardware-software support for efficient page migration in Multi-GPU systems
(IEEE, 2020-04-16) Baruah, Trinayan; Sun, Yifan; Dinçer, Ali Tolga; Mojumder, Saiful A.; Abellán Miguel, José Luis; Ukidave, Yash; Joshi, Ajay; Rubin, Norman; Kim, John; Kaeli, David; Ingeniería y Tecnología de Computadores; Facultad de Informática
As transistor scaling becomes increasingly more difficult to achieve, scaling the core count on a single GPU chip has also become extremely challenging. As the volume of data to process in today's increasingly parallel workloads continues to grow unbounded, we need to find scalable solutions that can keep up with this increasing demand. To meet the need of modern-day parallel applications, multi-GPU systems offer a promising path to deliver high performance and large memory capacity. However, multi-GPU systems suffer from performance issues associated with GPU-to-GPU communication and data sharing, which severely impact the benefits of multi-GPU systems. Programming multi-GPU systems has been made considerably simpler with the advent of Unified Memory which enables runtime migration of pages to the GPU on demand. Current multi-GPU systems rely on a first-touch Demand Paging scheme, where memory pages are migrated from the CPU to the GPU on the first GPU access to a page. The data sharing nature of GPU applications makes deploying an efficient programmer-transparent mechanism for inter-GPU page migration challenging. Therefore following the initial CPU-to-GPU page migration, the page is pinned on that GPU. Future accesses to this page from other GPUs happen at a cache-line granularity - pages are not transferred between GPUs without significant programmer intervention. We observe that this mechanism suffers from two major drawbacks: 1) imbalance in the page distribution across multiple GPUs, and 2) inability to move the page to the GPU that uses it most frequently. Both of these problems lead to load imbalance across GPUs, degrading the performance of the multi-GPU system. To address these problems, we propose Griffin, a holistic hardware-software solution to improve the performance of NUMA multi-GPU systems. Griffin introduces programmer-transparent modifications to both the IOMMU and GPU architecture, supporting efficient runtime page migration based on locality information. In particular, Griffin employs a novel mechanism to detect and move pages at runtime between GPUs, increasing the frequency of resolving accesses locally, which in turn improves the performance. To ensure better load balancing across GPUs, Griffin employs a Delayed First-Touch Migration policy that ensures pages are evenly distributed across multiple GPUs. Our results on a diverse set of multi-GPU workloads show that Griffin can achieve up to a 2.9× speedup on a multi-GPU system, while incurring low implementation overhe
Open Access
Portable device for presbyopia correction with optoelectronic lenses driven by pupil response
(Springer Nature, 2020-11-20) Mompeán, J.; Aragón, J.L.; Artal, P.; Aragón, J.L.; Artal, P.; Ingeniería y Tecnología de Computadores
A novel portable device has been developed and built to dynamically, and automatically, correct presbyopia by means of a couple of opto-electronics lenses driven by pupil tracking. The system is completely portable providing with a high range of defocus correction up to 10 D. The glasses are controlled and powered by a smartphone. To achieve a truly real-time response, image processing algorithms have been implemented in OpenCL and ran on the GPU of the smartphone. To validate the system, different visual experiments were carried out in presbyopic subjects. Visual acuity was maintained nearly constant for a range of distances from 5 m to 20 cm.
Open Access
Spartan: a sparsity-adaptive framework to accelerate deep neural network training on GPUs
(Institute of Electrical and Electronics Engineers, 2021-03-22) Dong, Shi; Sun, Yifan; Agostini, Nicolas Bohm; Karimi, Elmira; Lowell, Daniel; Zhou, Jing; Cano, José; Abellán Miguel, José Luis; Kaeli, David; Ingeniería y Tecnología de Computadores; Facultad de Informática
Deep Neural Networks (DNNs) have emerged as an important class of machine learning algorithms, providing accurate solutions to a broad range of applications. Sparsity in activation maps in DNN training presents an opportunity to reduce computations. However, exploiting activation sparsity presents two major challenges: i) profiling activation sparsity during training comes with significant overhead due to computing the degree of sparsity and the data movement; ii) the dynamic nature of activation maps requires dynamic dense-to-sparse conversion during training, leading to significant overhead. In this article, we present Spartan, a lightweight hardware/software framework to accelerate DNN training on a GPU. Spartan provides a cost-effective and programmer-transparent microarchitectural solution to exploit activation sparsity detected during training. Spartan provides an efficient sparsity monitor, a tile-based sparse GEMM algorithm, and a novel compaction engine designed for GPU workloads. Spartan can reduce sparsity profiling overhead by 52.5× on average. For the most compute-intensive layers, i.e., convolutional layers, we can speedup AlexNet by 3.4×, VGGNet-16 by 2.14×, and ResNet-18 by 2.02×, when training on the ImageNet dataset.
Open Access
TCOR: A Tile Cache with Optimal Replacement
(IEEE, 2022-04-06) Joseph, D.; Aragón, J.L.; Parcerisa, J.M.; González, A,; Ingeniería y Tecnología de Computadores
Cache Replacement Policies are known to have an important impact on hit rates. The OPT replacement policy [27] has been formally proven as optimal for minimizing misses. Due to its need to look far ahead for future memory accesses, it is often reduced to a yardstick for measuring the efficacy of other practical caches. In this paper, we bring the OPT to life, in architectures for mobile GPUs, for which energy efficiency is of great consequence. We also mold other factors in the memory hierarchy to enhance its impact. The end results are a 13.8% decrease in the memory hierarchy energy consumption and an increased throughput in the Tiling Engine. We also observe a 5.5% decrease in the total GPU energy and a 3.7% increase in frames per second (FPS).

Browsing by Subject "GPU"

Results Per Page

Sort Options