GridPP Technical Meeting
Virtual Only
Fortnightly meeting for technical topics looking further ahead than the weekly ops meetings on Tuesdays. There are also dedicated storage group meeting on Wednesdays. Each topic can go beyond the nominal 5 minute slot whenever necessary.
Notes from GridPP technical meeting 19/1/2018
Spectre/meltdown
There is a summary of benchmarking at:
https://indico.cern.ch/event/671507/
and these will be summarised at the next GDB. Others (e.g. Rob Currie) are looking at other file systems etc.
Martin Bly showed some of his results as a spreadsheet. He showed that the differences for HEPSpec was very small (less than 1%). There remains the question as to how well the HEPSpec represents the true workflows that we run, however this is useful regarding the difference of our commitments to WLCG. So pledges will be unchanged
Martin also showed results for write and read heavy tests. These results are all over the place and have a wide range of results - work in progress. This is work in progress.
We really need some standard candle jobs. Perhaps Hammercloud tests which show ~4% longer run times.
HPC benchmarks can be seen in: https://arxiv.org/abs/1801.04329
There will be a benchmarking working group meeting a week today, which will have updates.
Redhat have reverted the update on Spectre 2 as this may cause instabilities in higher end servers (this last part is rumour). They are looking to rely on Intel to have this pushed through the bios of the manufacturers (see https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00088&languageid=en-fr for Intel's latest).
If you find anything useful then please share it so that others don't have to waste their time going through everything that is out there.
There is no date for a new kernel that will solve this yet.
Useful links include:
For supermicro:
https://www.supermicro.com/support/security_Intel-SA-00088.cfm?pg=X11#tab
For Dell(changes daily):
https://www.dell.com/support/article/uk/en/ukbsdt1/sln308588/microprocessor-side-channel-vulnerabilities-cve-2017-5715-cve-2017-5753-cve-2017-5754-impact-on-dell-emc-products-dell-enterprise-servers-storage-and-networking-?lang=en
Also the redhat page at:
https://access.redhat.com/articles/3311301 David Crooks is writing this up and will share if it is useful.
Generally useful is the EGI link at:
https://wiki.egi.eu/wiki/SVG:Meltdown_and_Spectre_Vulnerabilities
See next GDB for update.
Bottom line is do what you can but "DON'T PANIC MR MANNERING! DON'T PANIC!"
Topic for GridPP
- EOSC-Hub, APEL, & GOCDB - what this new H2020 project is all about and what we have
planned for these grid tools. Adrian Coveney This was agreed.
- Security area - Led by David Crooks. This is agreed
- Really good storage workshop preHEPSYSMAN and so there will be 3 possible 4 talk. This is agreed.
- CMS CPU efficiency - David Colling. This is agreed.
- Networking. Possibly
- Clouds. Possibly (A. McNab/D. Colling to lead). Possibly
- Technical implications of UKT0 (P. Clarke/A. McNab). This is agreed.
- Singularity needs for CMS - some workflows needs SL7 and other SL6. Possibly
- Change from SL6 to SL7. LHC timescales may not match sites. Perhaps using containers is an option. Martin said that running SL6 containers (Docker) on SL7 had a ~6% drop in performance, however almost none running SL7 in containers on SL7. Pete G. questioned as to whether this was just a difference in running SL6 cf SL7 and Martin looked at some old figures which suggest that this is indeed possible although this is only ~half of it. Basically, experiments need to upgrade to SL7! This is agree (for now).
General Catch up
Minor problem with certificates (expired) with VAC over Christmas. Fixed pretty quickly.
Interesting topic came up at the preHEPSYSMAN storage meeting. Since then testing the spectre/meltdown patches on ZFS has started ... maybe 10% but on going.
AoB
Nothing.