Computer engineers at CERN today announced that the CERN Data Centre has recorded over 100 petabytes of physics data over the last 20 years. Collisions in the Large Hadron Collider (LHC) generated about 75 petabytes of this data in the past three years.
One hundred petabytes (which is equal to 100 million gigabytes) is a very large number indeed – roughly equivalent 700 years of full HD-quality movies. Storing it is a challenge. At CERN, the bulk of the data (about 88 petabytes) is archived on tape using the CERN Advanced Storage system (CASTOR) and the rest (13 petabytes) is stored on the EOS disk pool system – a system optimized for fast analysis access by many concurrent users.
"We have eight robotic tape libraries distributed over two buildings, and each tape library can contain up to 14,000 tape cartridges," says German Cancio Melia of the CERN IT department. "We currently have around 52,000 tape cartridges with a capacity ranging from one terabyte to 5.5 terabytes each. For the EOS system, the data are stored on over 17,000 disks attached to 800 disk servers." The video below gives a feel for the size - and sounds - of the centre.
(Video:CERN)
Not all the information was generated by LHC experiments. "CERN IT hosts the data of many other high-energy-physics experiments at CERN, past and current, as well as a data centre for the AMS experiment," says Dirk Duellmann of the IT department.
"For both tape and disk, providing efficient data storage and access is very important," says Duellmann, "and this involves identifying performance bottlenecks and understanding how users want to access the data."
Tapes are checked regularly to make sure they stay in good condition and are accessible to users. To optimize storage space, the complete archive is regularly migrated to the newest high-capacity tapes. Disk-based systems are replicated automatically after hard-disk failures and a scalable namespace enables fast concurrent access to millions of individual files.
The Data Centre will keep busy during the Long Shutdown of the whole accelerator complex, analysing data taken during the LHC's first three-year run, and preparing for the higher expected data flow when upgraded accelerators and experiments start up again. An extension of the Centre, and the use of a remote data centre in Hungary will further increase the Data Centre's capacity.
Expect further petabytes.