Tags:

Harnessing a supercomputer for ATLAS

2 June 2022 | By

The ATLAS Collaboration uses a global network of data centres – the Worldwide LHC Computing Grid – to perform data processing and analysis. These data centres are generally built from commodity hardware to run the whole spectrum of ATLAS data crunching, from reducing the raw data coming out of the detector down to a manageable size to producing plots for publication.

While the Grid’s distributed approach has proven very successful, ATLAS researchers are also exploring the potential of High Performance Computing (HPC) centres. HPC harnesses the power of purpose-built supercomputers constructed from specialised hardware, and is used widely in other scientific disciplines.

However, HPC poses significant challenges for ATLAS data taking. First, access to supercomputers is usually strictly limited, with connections to HPC computing nodes heavily restricted or even non-existent. Second, CPU architecture may not be suitable for ATLAS software and the installation of any required local software may be tightly controlled. Third, the system may only allow very large jobs using many thousands of nodes, which is atypical of an ATLAS workflow. Finally, the HPC may be geographically distant from storage hosting ATLAS data, which may pose network problems.

Outreach & Education,ATLAS — Figure 1: Andrej Filipčič (left) and Jan Jona Javoršek (right) from the Jožef Stefan Institute in Ljubljana, Slovenia, next to Vega. (Image: B. Zebec/Izum)

Despite these challenges, ATLAS collaborators have been able to successfully exploit HPC over the last few years, including several near the top of the famous Top500 list of supercomputers. Technological barriers were overcome by isolating the main computation from the parts requiring network access, such as data transfer. Software issues were resolved through the use of container technology, which allows ATLAS software to run on any operating system, and the development of “edge services”, which enables computations to run in an offline mode without the need to contact external services.

The most recent HPC to process ATLAS data is Vega – the first new petascale EuroHPC JU machine, hosted in the Institute of Information Science in Maribor, Slovenia (see Figure 1). Vega started operation in April 2021 and consists of 960 nodes, each of which contains 128 physical CPU cores, for a total of 122,800 physical or 245,760 logical cores. To put this in perspective, the total number of cores provided to ATLAS from Grid resources is around 300,000 cores.

The Vega supercomputer in Slovenia is the most recent HPC to process ATLAS Experiment data.

Due to close connections with the community of ATLAS physicists in Slovenia, some of whom were heavily involved in the design and commissioning of Vega, the ATLAS Collaboration was one of the first users to be granted official time allocations. This was to the benefit of both the ATLAS Collaboration, who could take advantage of a significant extra resource, and Vega, which was supplied with a steady, well-understood stream of jobs to assist in the commissioning phase.

As seen in Figure 2, Vega was almost continually full of ATLAS jobs from the moment it was turned on, and the periods where fewer jobs are running are due to either other users on Vega or a lack of ATLAS jobs to submit. This huge additional computing power – essentially doubling ATLAS’ available resources – was invaluable, allowing several large-scale data-processing campaigns to run in parallel. As such, the ATLAS Collaboration heads towards the restart of the LHC with a fully refreshed Run-2 dataset and corresponding simulations, many of which have been significantly extended in statistics thanks to the additional resources provided by Vega.

Vega Computing — Figure 2: Number of Vega CPU cores occupied by ATLAS from April 2021 to April 2022, with the different colours showing different types of data processing. (Image: ATLAS Collaboration/CERN)

It is a testament to the robustness of ATLAS’ distributed computing systems that they could be scaled up to a single site equivalent in size to the entire Grid. While Vega will eventually be given over to other science projects, some fraction will continue to be dedicated to ATLAS. Further, the successful experience shows that ATLAS members (and their data) are ready to jump on the next available HPC and fully exploit its potential!