The trouble with terabytes

How ATLAS managed this incredible year of data-taking

14 December 2016 | By

Explore the CERN Computing Centre, home of the Worldwide LHC Computing Grid.

2016 has been a record-breaking year. The LHC surpassed its design luminosity and produced stable beams a staggering 60% of the time – up from 40% in previous years, and even surpassing the hoped-for 50% threshold.

While all of the ATLAS collaboration rejoiced – eager to analyse the vast outpouring of data from the experiment – its computing experts had their work cut out for them: “2016 has been quite a challenge,” says Armin Nairz, leader of the ATLAS Tier-0 operations team. Nairz's team is in charge of processing and storing ATLAS data in preparation for distribution to physicists around the world – a task that proved unusually complex this year. “We were well prepared for a big peak in efficiency, but even we did not expect such excellent operation!”

“Data-taking conditions are constantly changing,” says Nairz. “From the detector alignment to the LHC beam parameters, there is never a ‘standard’ set of conditions. One of our key roles is to process this information and provide it along with the main event data.” This job, called the 'calibration loop', can take up to 48 hours. Countless teams verify and re-verify the calibrations before they are applied in subsequent bulk reconstruction of the physics data.

Before 2016, the Tier-0 team would have a 10 to 12 hour break between each LHC beam fill. This gave their servers some breathing room to catch up with demand. “In the weeks leading up to the ICHEP conference, the LHC was working almost too perfectly,” says Nairz. “At one point, it operated at 80% efficiency. This meant there were very short breaks between runs; just 2 hours between a beam dump and the next fill.”

While all of the ATLAS collaboration rejoiced – eager to analyse the vast outpouring of data from the experiment – its computing experts had their work cut out for them.

The CERN IT department provided an extra 1000 cores to help the ATLAS team cope with ever-growing demand. However, it soon became clear that that would not be enough: “We had to come up with a new strategy,” explains Nairz. “We needed a way to grow Tier-0 without relying on more computers on-site.” Their solution: outsource the data reconstruction to the Worldwide LHC Computing Grid.

To accomplish this feat, Nairz's Tier-0 team joined forces with the ATLAS Distributed Computing group and the Grid Production team. “Together, we had to train the Grid to process data with a Tier-0 configuration in the much-needed short time scale,” says Nairz. “We experimented with lots of different configurations, trying to steer the jobs to the most appropriate sites (i.e. those with the best, quickest machines).”

This was quite an arduous task for an already-busy team, though it proved very effective. “Despite overwhelming demand during ICHEP, we were able to shepherd copious amounts data into physics results,” says Nairz. “In the end, the data presented at the conference was just 2 weeks old!”

The Tier-0 team will be ready should such a situation arise again. “Although this solution took enormous effort, it was ultimately successful,” concludes Nairz. “However, ATLAS computing management are now preparing to add new computing resources in 2017, in the hopes of avoiding a similar situation. We have also used this experience to help improve our reconstruction software and workflow, bettering our performance as the year went on.” After all, an experiment is only as valuable as the data it collects!

About the Grid

The Worldwide LHC Computing Grid is a global collaboration of computer centres. It is composed of four levels, or “Tiers”. Each Tier is made up of several computer centres and provides a specific set of services. Between them the tiers process, store and analyse all the data from the Large Hadron Collider.

ATLAS Tier-0, located at the CERN data centre, has about 800 machines, with approximately 12,000 processing cores. This allows 12,000 jobs to run in parallel, and up to 100,000 jobs are run per day. During data-taking, the ATLAS online data-acquisition system transfers data to Tier-0 at about 2 GB/s, with peaks of 7 GB/s.