September GDB – 14th September – CERN
Present at CERN (am)
Jeremy Coles (GridPP/UK); Matt Doidge(Lancaster); Alberto D’ Meglio (CERN); Jhan Wei Huang (ASGC); IKeo Ueda (ATLAS/Tokyo); Tim Bell (CERN); Maarten Litmaath (CERN); Tony Cass (CERN); Oxana Smirnova (NDGF); Pete Gronbech (Oxford); Dave Kelsey (RAL); Mark Mitchell (Glasgow); Ricardo Silva (CERN); Pacoma Fuente (CERN); David Callados (CERN); Sven Gabriel(NIKHEF); Jim Shank (ATLAS); Romain Wartel (CERN); John Gordon (STFC); Helge Meinhard (CERN); Gustave Aiftielliei (INDN); Luca Dell’Agnello; Claudio Grandi(INFN/CMS); Maria Alandes (CERN); Maite Barroso (CERN); Andrea Sciaba (CERN); James Adams (STFC/RAL); Alberto Masoni (ALICE/CERN); Milos Lohajicek (Prague); Pierre Girard (CC-IN2P3); Matthew Viljoen (STFC/RAL); Elena Korolkova (Sheffield); Florida Estrella (CERN); Andrew McNab (Manchester); Jan Van Elclik (CERN); Gavin McCance (CERN); Marian Babik (CERN); Catalin Condurache (STFC/RAL); Gareth Smith (STFC/RAL); Ian Collier (STFC/RAL); Bernd Panza-Steindel (CERN); Markus Schulz (CERN/IT); Manuel Guitarro (CERN/IT); Mattias Wadenstein (NDGF); Sam Skipsey (Glasgow); Graeme Stewart (CERN/ATLAS); Alessandra Forti (Manchester); Andrew Elwell (CERN); Shaun De Witt (RAL); Peter Love (Lancaster); Christopher Walker (UKI-LT2-QMUL); Michel Jouvin (LAL/CNRS)
Note: The minutes generally only record what was said in addition to the content of the presentation slides for each speaker.
Introduction (John Gordon)
IPV6 day was on 8th June. Heard little about it because it worked! There was an identity management workshop at CERN 9th-10th June. WLCG workshop took place at DESY 11th-13th July. There was a suggestion for a regular newsletter and Ian put the first out before the summer. The next meeting is on 12th October. The second Wednesday of each month will be used for GDBs in 2012. Please could people check the proposed dates for 2012 and let John know of any clashes. It was suggested that Munich may host one of the meetings (the EGI community forum takes place there 19th-23rd March).
Chris Walker. QMUL is having problems with CREAM and keeping LCG-CEs for safety. On your LCG-CE slide you mention removing availability calculations in October but this may not be ready. 20,000 jobs per day – reduced retention to 5-7 days. Maarten Litmaath: Are you following this up with the developers? Does Lyon have good experience with SGE? The site is patching CREAM and waiting feedback from the VOs/experiments. Jobs need to be cleaned on the panda server. One of the problems in the UK was fixed using the IN2P3 fix. Also note that IN2P3 is not solely on SGE/CREAM (perhaps 80%). JG: Support for grid engine might be a future topic for this meeting.
Lorne Levinson – does this mean sites without CREAM would then have zero availability? JG: that would be the implication.
Dave Kelsey – recently circulated a new policy document to the GDB list looking at virtual machine images. This describes the responsibilities of the various parties. Invite feedback by 28th September. There is nothing in the policy forcing adoption. The idea is to get to a situation with common policies.
Technical Evolution Group (Markus Schulz)
General comments – mandate is uploaded to the agenda. Will be managed by the MB. Groups are divided into:
Data Management; Storage management; Workload management; Databases; Security model; Operations tools and services
JG: Some issues at shared sites such as sysadmins not having root access to the clusters and it would be good to have these represented. MS agreed.
Security Service Challenge 5 (Sven Gabriel)
GS: Useful exercise for ATLAS. Quickly identified who had sent in the malware and banned the users – the efficacy was shown in the chart. Helped with a real incident in CA a few months ago. On the Panada issue – yes everyone can look at what is going on but that is also useful for people. There may be some privacy issues for ATLAS to think about but before then we would like a statement on the legal obligations.
IM is useful but how secure is it?
SG: It was a closed channel and would not easily be viewed.
??: At one point you showed some mail templates. Where are they?
SG: I can provide you with a link. Many of these were in the site alarm email.
DK: In response to Graeme, the policy group are starting to look at this together with experts. I suspect the answer will be
Matt Doidge: From a site perspective this was very useful. Pulled up some interesting questions such as need for training with latest techniques.
SG: This is what I hinted at with the per site training module.
RW: 90% of incidents are now multi-site.
gLite (Maria Alandes Pradillo)
gLite 3.2 fully supported until the end of October. Data Management services remain updated until April 2012.
gLite 3.1 LCG-CE and WMS fully supported until end of October.
Are there important issues that need fixing in gLite 3.2 CREAM?
DK: Your comment on the communication with the software vulnerability group. Is this now improving?
MAP: Yes they are aware of the issue. The advisory needs to relate to the upcoming/latest middleware.
EMI-1 (Doina Cristina Aiftimiei)
WLCG recommended versions (Markus Schulz)
Results of discussions at workshop are now on the twiki page for baselines. Maarten has informed the MB. With the WNs and UIs we should be careful and not rush. The recommendations can be read online. Need to look at where we are with the monitoring.
Stand alone services: BDII; CREAM-CE etc can be moved first. With storage services some care has to be taken but not recommended to move now as it is too risky during data taking.
Are there any clear instructions on the transition? Have the repository updates been noted clearly?
MS: This is in the UMD release notes. The installation instructions come from EGI.
Future updates (Andrew Elwell)
AE: What is the solution at Lancaster?
MD (Lancaster):LSF vendor solution. For group of nodes you can not have a single node – distributing certificates is done by hand.
You say that puppet is the trendy solution but is it the recommended one?
AE: It depends on scale. With grid machines many machines are similar. At the moment I would suggest it is recommended.
LHCb Experiment Operations Report (Stefan Roiser)
Slide 9: Job runtime environment issue was related to contention on the local disk.
CMS (Claudio Grandi for Ian Fisk)
JG: Do you have plots showing how sites are doing against their pledges?
CG: No. But by the middle of next year CMS will be using all the pledged resources.
On slide 8 which sites are the lowest?
CG: I do not know from this graph but I can check and get back to you.
ML: Presumably you would open tickets against badly performing sites?
CG: Yes that is right – if the efficiency is systematically low it will be followed up.
MJ: About the myproxy issue was it at CERN?
MJ: What was the version?
ML: The latest version in EPEL. There were some patches applied. The release was just completed last week. Only if you are using wildcards in retreivel or renewal policies then you need the patch.
CG: For CMS the phedex server is here at CERN so not using an external myproxy server. For CRAB may also use myproxy servers outside of CERN then either you need the old version of myproxy or the new one with the patch. CRAB does the submission with the –r option. If the submission is via the WMS then –r is not used.
ATLAS (I Ueda)
JG: ON slide 4 does the plot show that you have moved half the distribution to T2s?
IU: The policies this year are different but the plan has restarted.
Pierre (Lyon): Slide 8. About the WN issue. We know the problem is with AFS and now we have CVMFS and the efficiency is much better. About the DCCP we are waiting for a developer response.
On Storm: Luca (CNAF): On the last point on the strange behaviour and getting worse with a newer version… unfortunately there are still some serious problems on the disk used for ATLAS. We had three serious stops of the storage. In yesterday’s ticket some explanations were given for the slow rate transfers.
IU: I looked into all these tickets (slide 9) to see if these were SE issues.
LdA: After some days we realized there was a memory leak and this has been fixed in mid-August patches. Then started the other problems. QMUL improved a lot after the upgrade. This explains the inconsistencies.
CW: From QMUL point of viewer. The previous version was crashing several times a day. ATLAS pilot jobs can rapidly fail and that was one driver for updating. We were also getting 100GB log files per day. I failed a ticket on this before the upgrade and ATLAS did little about it. In the process of upgrading things broke noticeably whereas the crashes before were more hidden. We are now running better.
LdA: What we discovered – we suffered directly as T1 because certification process was not complete. The developer teams need feedback from more stressful early adoption tests. It was clearly an error to install the release too early. All the critical bugs were found after the deployment to production (2 memory leaks and crashing front-end).
Mario David: There was a series of events. People upgraded – memory leaks were in front and backend. In staged rollout the memory leak was not discovered even with high-load from CMS. Only when QMUL upgraded and workflow of ATLAS analysis changed the usage exposed the problem.
JT: Slide 18. NIKHEF We used to have the authorized alarm lists. I thought the upgrade goes to the normal email address not the alarm address.
ML: Slide 13 When discussed we mitigated the issue you mention (drawing down proxies) using keys. So if you were to go the whole way with the implementation you would be able to better than you present here. Some of the other points are valid.
JG: You say ….
Simone Campana: One of the points raised by SSC5. It was shown earlier there was a tail. You need to stop the pilot factory and you have a tail of jobs. Both panda server and pilot factory can be implemented to accept ARGUS ban. ATLAS would then use ARGUS to understand the ban. The response time using this service would be much quicker.
JG: You mean worldwide blacklisting?
JG: The question of whether forensics can be complete enough is left to the TEG. In the UK there are issues with jobs killing WNs etc. and relying on email and forensics to trace through it will take much longer than if identity changes take place.
JT: This glexec document that came through. There are some factual inaccuracies about glexec in that document that need to be corrected.
Lyon: With glexec we have complete control with who can do what on our site. The ATLAS technique does not allow this. For example my site may ban a certain user.
ALICE (Latchezar Betev)
Meeting closed at 16:10
09:02:29] CERN 31-3-004 Sorry we are late. Coulkdn't get into the room until 1000
[09:06:08] Jeff Templon no audio
[09:06:10] Alberto Aimar is audio ok for the others?
[09:06:24] Ioannis Papadopoulos NO, it is muted at CERN
[09:06:26] Alberto Aimar no audio
[09:06:29] Andreas Heiss i cant hear anything
[09:06:48] Ioannis Papadopoulos Can CERN unmute please?
[09:08:04] Ioannis Papadopoulos video is ok, but still no audio
[09:08:29] Alberto Aimar no audio from CERN
[09:08:51] Jeremy Coles John is just trying to figure out what is muted.
[09:08:54] Jeff Templon ok
[09:08:54] Ioannis Papadopoulos Could you restart the audio at CERN?
[09:09:44] Ioannis Papadopoulos Now is ok
[09:09:45] Jeff Templon yep
[09:09:54] Andreas Heiss
[09:09:59] Hélène Cordier thanks
[09:31:47] CERN 31-3-004 Can you see the slides
[09:34:57] Alberto Aimar i see the room but not the second cam on the slides
[10:09:00] Jeremy Coles The slides and movies are on the agenda page
[10:09:57] Jeremy Coles http://indico.cern.ch/materialDisplay.py?contribId=2&sessionId=1&materialId=slides&confId=106648
[10:11:49] Gonzalo Merino I only see 2 movies there. No slides.
[10:30:02] Jeff Templon turn off your mike, islamabad
[10:30:17] Mario David imediately please
[10:30:30] Jeff Templon great
[10:31:23] Jeff Templon according to evo it's islamabad
[10:31:32] Jeff Templon they have a microphone open
[10:32:38] Jeremy Coles I'll check on the slides.
[11:07:15] Mario David http://repository.egi.eu/
[11:28:00] CERN 31-3-004 STARTING AGAIN AT 1400 cet
[13:03:03] Philippe Charpentier Yes
[13:03:23] Paolo Veronesi joined
[13:03:28] CERN 31-3-004 joined
[13:06:11] Andrew Elwell joined
[13:07:32] Andreas Heiss cvmfs will be setup at GridKa. Will take some time, though.
[13:07:55] Andrew Elwell left
[13:08:14] Philippe Charpentier Thanks Andreas, keep Alexey updated on schedule please!
[13:08:17] Andrew Elwell joined
[13:10:31] Dagmar Adamova joined
[13:16:30] Tiziana Ferrari joined
[13:17:28] Andrea Ceccanti joined
[13:52:04] Philippe Charpentier left
[13:54:43] joris maes left
[14:06:14] Ioannis Papadopoulos joined
[14:18:38] Ioannis Papadopoulos left
[14:27:20] Gonzalo Merino left
[14:27:46] Andreas Heiss left
[14:30:30] NCP Islamabad joined
[14:33:18] RECORDING NCP joined
[14:39:18] Pablo Fernandez left
[14:42:45] Andrew Elwell left
[14:44:18] Paolo Veronesi left