June 2, 2022 — As part of a series to share best practices in preparing apps for Aurora, Argonne highlights researchers’ efforts to optimize code to run efficiently on GPUs .
Kris Rowe, a computing scientist at the Argonne Leadership Computing Facility (ALCF), is leading efforts to optimize the NekRS computational dynamics solver for deployment on Aurora, the ALCF’s next 2-exaflop system. The ALCF is a US Department of Energy (DOE) Office of Science User Facility located at Argonne National Laboratory.
Like a number of computer science engineering applications, NekRS relies on OCCAa vendor-neutral open-source framework and library for parallel programming on various architectures. OCCA allows transparent generation of raw backend code, as well as just-in-time compilation that allows SYCL/CPD++– compatible kernel caching.
NekRS makes extensive use of the OCCA framework to achieve performance portability, leveraging the OCCA runtime to manage GPU memory, queue computations to GPUs, and perform just-in-time compilation of math kernels . The runtime translates human-readable source code into machine-executable binary while the application is still running. Additionally, all NekRS math cores are implemented using OCCA Kernel Language (OKL).
Developing the DPC++ backend for the OCCA Portability Framework included implementing a source-to-source compiler to translate OKL to DPC++.
- Successively tackle code problems of increasing scale and complexity.
- Use roofline analysis to identify compute-intensive cores and focus optimization efforts.
- Research alternative framework backends to find useful features to integrate.
- Replacing rather than revising certain algorithms can save time and improve performance.
- Select sample inputs that are representative of the science problems the app will be used for.
Nek app version changes
NekRS, which started as an offshoot of libParanumal (an experimental set of finite element flow solvers developed at Virginia Tech), is composed, unlike its Fortran-based predecessor, primarily of C++ code.
Beyond providing a user interface similar to that of the computational dynamics solver Nek5000, the construction of NekRS represents a major departure from its predecessor. However, the NekRS developers have designed their code to include significant backwards compatibility, allowing users to port their existing case files with minimal effort.
Nek5000 was originally designed for strong scaling, i.e. to minimize the time to solve a given problem. NekRS is also designed for strong scaling, but of a much greater magnitude: the exascale computing power will help to perform much larger multi-scale and multi-physics simulations than possible with previous generations of computational systems. high performance computing (HPC).
The Aurora-optimized version of NekRS should differ only slightly from other iterations of NekRS source code. Because differences in compiler code generation and between GPU microarchitectures can have a significant impact on performance, code customization is concentrated among the most performance-critical compute cores. The ALCF maintains a fork of NekRS (a copy of the source code) on its GitHub page to enable better user accessibility and for further development and hardware-specific optimization.
Optimize NekRS for Aurora
The optimization strategy of NekRS, as with many HPC scientific codes, is to climb a ladder of increasingly complex problems. In other words, developers start with small problems, often tailored to a single compute node or server, before working on solving large-scale difficulties.
A single scientific code can exhibit a wide range of performance characteristics depending on how it is used or the problem to which it is applied. Because of these variations in behavior, it is important that performance is analyzed using sample inputs that are representative of the challenges the code will be used to solve.
The ExaSMR effort supported by the ECP provides an illustrative example. ExaSMR aims to model the entire core of a small modular reactor, which as a simulation is far too large to be generated by a single node. The NekRS developers, however, provided Rowe with a simulation of a single reactor rod as a representative test problem to run on a single node.
This test case, combined with Intel’s VTune and Advisor performance tools, allowed Rowe to identify the steps in the NekRS resolution process that take the most time. This is because, as is the case with many other scientific codes, a small number of math cores are responsible for most of each GPU’s computational time. Identifying insights like this that can reduce the targeted amount of code from tens of thousands of lines to just a few hundred is critical to accelerating optimization efforts.
Following the identification of critical cores, Roofline Analysis (as performed with Intel Advisor) calculates the difference between actual measured performance and the theoretical peak of the hardware. Additionally, the information gathered through roofline analysis can help determine which optimization techniques are most appropriate for a given situation, or even whether an entirely new algorithm should be considered.
Such considerations of new algorithms have driven what Rowe considers some of his most interesting recent work. In one case, the researchers added an algorithm that exploits an algebraic structure known as a tensor product. The tensor product of the discretization methods used by NekRS makes it possible to calculate a given kernel in several equivalent ways. For example, data for a single mesh element (subdomain) can be represented either as a large vector or as a cube of numbers. Both approaches have strengths and weaknesses and offer different pros and cons depending on kernel, problem size, and hardware characteristics.
Once NekRS has been optimized for a single Aurora compute node, Rowe will use a sequence of larger test issues to optimize multi-node performance. Particular attention will be paid to the MPI (message-passing interface) communication models exercised by the application. The final step will be to analyze the performance of NekRS as applied to the full-scale ExaSMR reactor core problem running on the full Aurora system.
Keep the OCCA up to date
The OCCA framework is compatible with kernels written in OKL in addition to those written in backend-specific code, such as SYCL. The main goal of optimizing NekRS kernels is to improve performance to the maximum extent allowed by OKL kernels; OKL kernels were selected because they are immediately portable to other vendor architectures.
However, in some cases additional performance improvements can only be achieved by using newer features that the SYCL programming model allows and for which no equivalent exists in OCCA. Subsequently, developers study the programming models of alternative OCCA backends like CUDA, HIP and OpenMP, for analogous functionality. In cases where such analogues exist, the functionality can be introduced into the OCCA framework in a unified way.
This illustrates one of OCCA’s strengths – its relative agility allows developers to integrate cutting-edge developments as they happen.
To assess OCCA user needs and better coordinate development, Rowe led efforts to establish OCCA’s Technical Advisory Forum (TAF). Holding open meetings every two weeks, the OCCA TAF brings together stakeholders from DOE, US universities, Intel, AMD, Shell and CCG, among others.
Source: Nils Heinonen, Argonne Leadership Computing Facility