One of the most important discoveries of this decade was the Higgs boson. But researchers at High Energy Physics and Nuclear Physics laboratories and institutes would have been unable to find the Higgs boson without the IT staff maintaining the computer infrastructure collecting and analyzing the massive amount of data generated during their experiments. HEPiX is a community, which brings together these IT guys twice a year from around the world. This spring their event was hosted by the Wigner Research Centre for Physics in Budapest, which also plays a central role in CERNs IT infrastructure.
I was invited to HEPiX by Fabien Wernli, who works at CCIN2P3 in France, monitoring thousands of computers using syslog-ng. The syslog-ng application is developed here in Budapest, the city of the spring HEPiX workshop. Leaving the academic world behind over a decade ago, I really enjoyed talking to and listening to IT professionals working at academic institutions.
The CERN IT infrastructure
While not all HEPiX members work on data originating from CERN and the Large Hadron Collider (LHC), the heart of HEPiX seems to be CERN and the software tools used or developed there. Sites working on CERN data are organized into a tiered structure. All data from experiments are collected, stored and processed at CERN as the Tier-0 site. Different parts of data are forwarded to Tier-1 data centers, where they are processed further. And just like parts of a pyramid, Tier-2 and Tier-3 sites download data from here and do the actual analysis of data.
As I mentioned, the Wigner Research Centre for Physics in Budapest now plays a special role in the life of CERN: since 2012 the Wigner Data Center has hosted an extension of the Tier-0 data center of CERN. It is possible due to advances is networking: CERN and the Wigner DC are connected by three independent 100Gbit lines. In other words: this network can forward the content of almost ten DVD disks a second.
Maintaining this infrastructure requires an enormous amount of resources and work. It needs to be available around the clock, be fast and efficient while changing only gradually. Topics of the conference how these often contradicting requirements can be achieved.
The opening day of the HEPiX spring workshop focused on site reports describing new hardware and services as well as some of the research at the sites since the last meeting. The rest of the week covered topics related to large scale computing: storage, networking, virtualization. My favorite topics at the conference were security and basic IT services, as these were related to my field of interest: logging.
Logging came up in a number of talks. There were many Elasticsearch instances around at CERN and elsewhere. At CERN, these were consolidated recently under central management, and we learned about how many of the problems were resolved by introducing access control and regular maintenance. We also received a quick introduction, how collaboration between sites and infrastructures on security via a Security Operations Center works. Last but not least, I gave an introductory talk about syslog-ng and Fabien Wernli presented how they use syslog-ng to monitor tens of thousands of machines at CCIN2P3, a Tier-1 site in France. During the conference I had a chance to talk to him as well.
Fabien Wernli and syslog-ng
We learned at HEPiX that CCIN2P3 provides important services to CERN as a Tier-1 site. What else is it working on?
We are a computing facility inside the IN2P3. The IN2P3 is one of the institutions French National Center for Scientific Research (French: Centre national de la recherche scientifique, CNRS). It is grouping all the scientists and staff which work on nuclear physics and particle physics. Our facility is providing computing resources for all these labs. We work with a lot of different scientists, so we need computing power, storage and network. Over 85% of our resources are used by LHC because they have such huge experiments that it needs a lot of data processing power. There are many smaller experiments as well. Currently one which is growing, and will generate a lot of data is LSST, which is Large Synoptic Survey Telescope. It will take a picture of the whole sky every night generating 150TB data each time. It is not as much as LHC, but quite a lot. Our facility will be one of the main tiers – like for LHC – for this experiment.
I see, you have a PhD in Astrophysics. Why did you become a Linux administrator?
When you have a PhD you do not get expert on anything other than learning to learn things. Astrophysics is something I was interested in for a long time, and the other thing I was interested in is computing. I have been a computer freak since I was a kid, and this part was more promising for a carrier. It was also easier to find a job without having to travel the whole planet all the time. When you have a family, you want to stay somewhere. I love computing and it was a good opportunity. When I worked at the observatory in Lyon – where I did my PhD – I also did a lot of Linux administration. There were only one or two people there doing Linux administration but they did not administer the desktops. We were on our own so I improved my Linux skills a lot.
And with this new LSST research you can be back at least partially to astrophysics.
That is the good thing about IN2P3 or CCIN2P3 that we do our job for science. Not to make money or any financial profit. I prefer that to the industry, where you ultimately have to make money.
What are you doing at CCIN2P3?
My main function is system administration. Together with my colleagues were are ten admins and my specialty is monitoring. All things monitoring: metrics, logs, analysis or anything related.
How did you first meet with syslog-ng? Why did you decide to use it?
When I arrived at CCIN2P3 there was already a central syslog server, and it was syslog-ng. A very old version, I think 2 or something. When I had to architect a new system, which would replace that one, I looked around and syslog-ng looked the most promising, mainly due to three facts. The first one was documentation, which was great compared to competitors. It was in depth and versioned. I could look up documentation even for an old version. And the configuration examples you copy and pasted actually worked. The second is, that it is portable. By that time we had Solaris, AIX and we had Linux and it would compile or was available as a package almost everywhere. And the community was a third reason I chose it. The community is very friendly. There were people at that time on IRC, and the mailing is helpful, very good resource as well.
You have made many contributions to syslog-ng. Which are you most proud of?
Maybe I have made many but those are small ones. The one I am probably the most proud of is the last one, the HTTPS destination to Elasticsearch. And maybe the many issues I opened. And even more proud, that the issues I opened are actually addressed. So my convincing power seems to be OK