GDB minutes – 10th February 2010
(These notes record the discussion during the meeting and do not summarise the presentations – Jeremy Coles).
Introduction (John Gordon)
Q: Have the experiments got enough SL5 WNS?
GS: ATLAS is moving to SL5. SL4 sites will start to see job load squeezed.
JG: Nobody for CMS or LHCb to comment.
Installed capacity- please look at sanity checks run by Nagios on all sites.
EGI council met in Amsterdam on 3rd February. The statutes were agreed and the Foundation created.
LHC startup agreed at Chamonix workshop. 2010 and 2011 plan to run from mid-February to end of November.
IB: The message is that 2010 and 11 will have full schedule with only short technical stops.
JG: Small 6 week window for technical changes.
EGEE Service baseline list (Nick Thackray)
The client and service version information are at these URLs:
https://twiki.cern.ch/twiki/bin/view/EGEE/SupportedClientVersions
https://twiki.cern.ch/twiki/bin/view/EGEE/SupportedServiceVersions
JT: What does support mean here?
NT: If support requested in a GGUS ticket say, then first check before responding is whether you are on a supported version.
JG: If you find a bug and fix this in the recent version do you back port to others?
NT: yes if it is a supported version. Will have phases – fully supported; security fixes but no bug fixing and then no fixing and support is dropped.
MJ: Could you just support for the last two versions?
JG: More relevant to SA3 side of things – once version shown consistent for 2 months then notice given for end-of-life for older component.
NT: One thing being considered is that releases are not at regular intervals so putting forward a time window was easier to handle
SB: What do you mean by a version here? Some things are minor while others require major interventions. It is not always possible to upgrade.
NT: In the detail we do distinguish types and they are treated differently, particularly the major upgrades. One issue is that a lot of support gets used up for old release problems.
SB: You can say it is not supported but that does not mean sites will upgrade.
MS: The idea – if someone comes for support your first question is are they running a supported version? If not the ticket is closed noting this situation.
The lifetime of services on the webpage are currently very generous and they will become stricter.
JT: So long as things are backward compatible then 6 months is ok. If not then 6 months is a challenge.
JG: Discussed with experiments. If a change means that the experiment does not want to move then the operations meeting will agree a timeline.
Changes to versions and lifetimes are discussed by the TMB. Sites given 4-6 weeks to upgrade. We can not force sites to upgrade except in the event of a security patch/vulnerability.
Q: what is the relationship between the lifetime and the baseline for WLCG?
NT: It is possible the WLCG baseline has older components. If there is a high-priorty update then there is a 4-week window.
JC: Your last statement indicates that sites will be suspended if they do not upgrade but earlier you said that sites would not be forced to upgrade.
NT: Basically if a site can provide a good reason for not moving then they can stay on the version but will not be supported – for example if they need a given release to support a VO.
JT: The reason we are doing this is that there are a small number of sites causing problems. All sites subject to this policy because of the minority of sites. Better make sure that some management later on does not pick up on your first statement to force sites.
ARGUS - Technical Update (Christoph Witzig)
JT: The product team that you mention – will that continue in the EGI era?
CW: Yes. All institutes have signed up and work allocations agreed in meeting at the end of last year.
Glexec/ARGUS pilot service – (Antonio Retico)
A presentation on short-term plans.
JT: You said the VOs will verify their frameworks with glexec/ARGUS
AR: No, glexec only at this time.
LdA: INFN/CNAF Have prepared some nodes for experiments to test and will deploy widely after that.
FH: This requirement for new VO boxes for ALICE, will this also apply to other production boxes? Do they need a new software area?
ML: They are afraid of messing up production trying these developments, so prod should use current version of ALIEN, but where site has resources for testing they ask for a new instance. This allows them to play with the installation and after this it gets integrated into mainstream of ALIEN and becomes next prod version.
FH: ATLAS requested separate sw area when moving to SL5 and maintaining this in parallel is an overhead – please bear this in mind if requesting something similar for bringing in ARGUS.
Experiment status: ALICE testing, CMS just getting started, ATLAS will get more involved with scale tests later on.
FH: What is next? Pilot service will provide good results and then what? Is it planned to deploy ARGUS instead of SCAS?
LdA: For CNAF, if ARGUS passes certification then we plan to not adopt SCAS but move directly to ARGUS.
JG: Do you have a timescale for that?
LdA: Depends on experiment testing timescales. By end Februray hope to have some clear indication.
FH: Sites testing operability but not the scalability.
JG: Did we repeat the tests with secure mode?
CW: Longest tests – 6 diff hosts with 10 clients – gave performance and stability indicators. Performance will be retested once 1.1 released.
ML: Don’t remember numbers but even with authentication it seems there is no major problem to fear.
JG: Mentioned new version of glexec required for 1.1 testing?
CW: You need the latest version of lcmaps plug-in that is called by glexec. The way ARGUS is called from glexec – uses lcmaps and calls up module… need the plug-in to send the information… the right lcmaps plug-in must be called. If you have glexec on WN then it is a matter of configuration. Just a matter of rpm. Certified for v1.0 but v1.1 needs retesting.
[10:28:22] Gianni Pucciani the lcmaps rpm I am using to certify Argus 1.1 is not yet certified, I take it from Etics.
Middleware update (Maria Alandes Pradillo)
JG: Staged rollout middleware is almost ready for production.
ML: It is not a free-for-all it is committing to potentially having to downgrade their production instance… so they may have extra resources to fall back on
SB: How many sites are involved with the staged-rollout?
ML: About 10
MAP: But not all services are covered
AR: There are 15 sites of which 5 are main production ones.
Refer to https://twiki.cern.ch/twiki/bin/view/EGEE/SL5Planning for the SL5 release plans.
AR: You said a new version of YAIM for CREAM. Does this fix the problem of services being started in the wrong order?
MAP: I’m not sure.
SB: Will this continue through the transition to EGI?
OK: We are doing everything so the answer to that question is yes.
Pakiti – The patching status monitoring (Romain Wartel)
JT: This service is a hackers dream. It is one place you go to find out to find out how systems are patched.
RW: There is central service but you can run in a different configuration in your site. The tools is access controlled using certificate and apache.
JT: It is very useful but
??: Where does the front-end get the CVTs?
RW: oval is the one we use.
RW presented a video on security exposures.
EGI without ROSCOE (Jamie Shiers)
There were several communities involved with the bid. Some people are discussing whether a streamlined proposal can be submitted for a later call. We had hoped to get funding for people concerned (3 CERN +2 INFN) for May. Without these projects it is effectively 1 FTE per experiment. So if this is what we are really missing then going through EU routes is perhaps not the most efficient. There is a need for this support and CERN is investigating whether this can be funded in other ways. The other partners may be doing something similar. Next steps, within HEP partners discuss the feedback and evaluator scores.
INSPIRE – the level of support for these was 60 months for the dashboard 204 for LHC VO services and 60 for Ganga. There would have been a 3 year profile. So for two areas this is about 1 FTE for dashboard and Ganga. So in an optimistic viewpoint we might still get 1:1 and 3.? Funded people. Need now to understand what money will come from Brussels and also what top-ups may be available.
Steven Newhouse: On the timescales. We should get by the end of February the feedback from the panel. Negotiations start at the end of March and last 3-4 weeks for the INSPIRE funding. EGI-Inspire was driven to 4 years.
JG: Can see how someone might cynically think HEP might be able to cover this but for all SSCs to fail might raise questions in several countries.
Distributed Database Workshop Summary & Tier-1 Service Coordination (Maria Girone)
JG: You mentioned earlier about baseline services
JT: I understand why your group is looking at many of the services you talked about in the mandate slide – including other services (workload management services, WLCG baseline, security)….. so this adds another meeting (i.e. the “Tier-1 service coordination meeting”) where people need to be present in case a decision is made that impacts you. There are many meetings…
JS: This was agreed at the last MB and reduces service meeting time by a factor of 3. See the “streamlining” slide.
JT: If you were there!
FH: I agree that this is an extra burden.
MB: Meetings have different scopes. Daily meetings are for daily problems. An issue for sites is that different people need to attend different parts of the meeting.
MG: FTS2.2.3 should be discussed somewhere…. Can work on the agenda to optimise
JG: This is essentially for a Tier-1 audience. The GDB is more for all sites – i.e. wider distribution. If you come to the GDB and have the same discussions then the meetings have failed.
GM: I also see the problem at my site – different people (DBAs for example) need to join for different parts of the meeting.
JG: So far there has only been one meeting so perhaps too early to conclude how useful it has been.
HEPIX virutalisation working group(Tony Cass)
GS: Do you have contact with the experiments?
TC: Nobody specifically at the meetings to comment as an experiment though people from the experiments may be there.
Massimo: ATLAS starting working group for T3s and one of these is relates to this area and it would be useful to have synchronise the discussions.
TC: I understand from Ian that there is some plan to link up in this area.
M: Many agree on this but we need to think how to do it. At the GDB discussion last time there was no common site/experiment view on what was wanted so this working group is looking at the issues. A real clear view of where everyone wants to go in this area would be useful.
AF: If the main aim is to run these images is for the experiments then they should be involved.
TC: Compared to data moving around … but the number of times the images changes is more important. If there are 250 images in current use then you need to have storage for all of these. Image transmission is about shipping/storing the differences – so it is related to image management.
JT: Are you worried about things that might torpedo this area? For example sites not wanting to download certain payloads?
JG: Tony is trying to come up with some proposals to see if sites are going to be able to go along with this
TC: Putting the comments together then perhaps we need to have another discussion in this forum about common views on how these might be used.
Running a reliable site (John Gordon)
Asked for feedback before the meeting but did not get much. The matter to discuss is whether there can be useful discussion on common problem areas (so that sites learn from each other rather than have to learn the hard way).
Oracle:
3D etc. Have a working group already where there is sharing already.
Installation & Configuration:
Quattor is quite useful for those that use it. Are there other places like in cfengine where discussion can happen
Storage Systems (dCache, CASTOR, DPM):
Maria’s meeting – often sharing of problems not always solutions! dCache have a meeting with Tier-1s for example. CASTOR have regular phone meetings. DPM has a mailing list – but there is scope for storage workshops.
Power & Cooling
Reported at HEPIX but not so much sharing of designs.
Tapes
Vlado made a request around HEPiX
Remote operations
Tony mentioned a new machine room with remote operations. In the UK we did work based on swine flu outbreak so have had some experiences of working the T1 from home
Benchmarking
There is a HEPiX group that worked on this area
Storage benchmarking
For example disk testing.
WAN
LHCOPEN there is a group.
LAN
Not much discussion on local networking. How are sites configured?
JT: Some of these things are discussed at the T2/experiment workshops.
JG: But the possibility for fragmentation between experiment preferences if this happens in isolation.
CPU
Mario David: What we have seen recently (as a T2) due to the number of software versions they have, they experiment with several and then we end up with failed jobs and we do not know why. Experiments later say this is not a problem but site spends time investigating.
Multi-user pilot jobs update (M Litmaath)
Working group now has 68 members. Questionnaire sent out on Jan 11-13. Most responses now received.
Results in slide are colout coded: red => no. amber means no but conditional. Blue is a requirement and Cyan implies a dependency.
GS: Using pilot jobs ATLAS can prioritise for users/groups etc. Special pilots used for their groups.
MD: As of now, we can discriminate based on FQAN. …
ML: Looks like CMS would have to look at this area for their pilot system.
Caltech ask about an attribute for job classad attribute to be updated with new user information so they can monitor changes in Condor.
MS: If you are on an AFS system then what token do you use?
SB: You are saying there are multiple – if we said setuid was required for example how many sites would be excluded? May just lose a few sites to analysis.
There will be some
Massimo: If a site says MUPJ is not accepted then depending on the numbers we have to consider about what was covered by the MoU.
JG: There is an option to use single user pilots or just submit production jobs to those sites. If 80% of sites had said they would not support this then we would have had to look at things like the MoUs but the responses do not suggest this measure is really needed. We do not want to push policies for the sake of it.
JT: You mentioned documents in the questionnaire. But I notice ALICE information is still mssing.
ML: ALICE are working on moving ALIEN to use glexec and I’ve advised them to do this in stages. Maybe they can report at the next meeting.
JG: One other issue in this area… getting sites configured according to LHCb was difficult. We said that we would have a SAM test to help. I’ve been told that we can now expect a test within about a week and there will be a trial against sites that already have glexec installed.
CREAM deployment news (Antonio Retico)
GS: We have found a show stopper with the 24hr limit on jobs. It is a condor problem – job lease feature that CREAM has… it causes the job to fail.
JG: For testing it should be an OR for LCG-CE OR CREAM-CE until an experiment only wants CREAM and then their experiment SAM should require CREAM.
The interface to GRIDVIEW
MS: With experiments using it more intensively we see more issues. We need more sites to move otherwise we’ll not see the problems on a very large scale. We do need to get more sites to take it up – helps up the priority on the certification too.
JG: Is 1.6 just bug fixes?
AR: Yes.
JG: You do not want a single release to have too many bug fixes.
Dug: Deployed into production at Glasgow. It was only when ALICE use started to pick up that we started to see problems… and then an issue with condor submission through condor-g. Issues with max startups defaulted to 10 but that was easily fixed through the configuration. With these two problems in hand things are ok. Works is also arriving from the WMS – now taking LHCb, CMS and ATLAS prod work. Would recommend taking pain to set up now as the upgrades are straightforward.
Massimo??: Used to document all the known issues in a wiki page. But I have to be told if a workaround is enough or new patches are needed to fix things – takes 2-3 months for patches to get through the system.
JV: The way CREAM handles the output – ina WMS like way. Put in a sandbox. In jdl need to look at the …. One side effect is if the server is not available the output stays on the WN for some hours and then makes node inefficient.
SB: On policy side, I thought we had already agreed that sites should deploy by the end of March.
JG: Worth having the discussion if there are many new bugs. What I’m hearing today is that there are no reasons not to continue with the the previously agreed rollout policy.
MJ: Reason did not do it before – SL5 upgrades, security updates etc. It was more work and for a T2 you generally just run one CE and you are introducing new problems – sharing gridmapdir and publishing. They are new things.
AF: For the main issue is to install something that requires baby sitting.
Chris Walker: What else can be dropped from the schedule if there are manpower issues. Now it sounds like the service is more stable I think more sites will deploy.
OSG Update (Ruth Pordes)
JG: Will follow up on some of the issues such as the installed capacity question.
The EGEE-EGI Grid Operation Transition (Maite Barosso)
Summary of discussion at the EGEE all activies meeting last week.
SB: Question about technical migration. How will you change these tools over to regional instances? For UKI for example do we carry on calling it UKI and split some time afterwards?
JG: Country already exists in GOCDB and CESGA accounting
SB: Within GGUS – do you get a mixture of NGIs and ROCs?.
JG: Need ROCs to exist until the end of EGEE!
SB: So long as there is no point when you are unable to submit a ticket to a site because the forwarding is not working.
Meeting closed: 17:30
Present at CERN
Jeremy Coles – UK
Maarten Litmaath – CERN
Roberto Santinelli – CERN
I. Ueda – JP/Tokyo
Hiroyuki Maisunaga – JP/Tokyo
Bryan Caron – TRIUMF/Alberta
Gianfranco Sciacca – ULC, UK
Christoph Witzig – SWITCH
Fabio Hernandez – CC-IN2P3
Ron Trumpert – SARA
Luca Dell’Agnello – INFN
Laura Perini – INFN-Milano
M Jouvin – GRIF/CNRS
Graeme Stewart – Glasgow
Alessandra Forti – Manchester
Jeff Templon – NIKHEF
Gonzalo Merino – PIC
Alberto Masoni – ALICE/INFN
Frederique Chollet – IN2P3
Andrea Sciaba – CERN
Antonio Retico – CERN
Helene Cordier – CC-IN2P3/FR
Maurice Bouwhuis – SARA/NL
Harry Renshall – CERN
Maite Barosso – CERN
James Casey – CERN
Gareth Smith – RAL
Ian Bird – CERN
G. Vestergombi – Budapest
M Schulz – CERN
David Collados – CERN
G. Carlin – INFN
Andreas Heiss – KIT
Holger Martin – KIT
Stephen Burke – RAL
Douglas McNab – Glasgow
John White – UHHIP (JRA1)
R Wart – CERN
John Gordon – STFC/RAL
On EVO:
Steven Newhouse
Denise Heagerty
Jukka Klem
Ionel Stan
Alvaro Fernandex
Massimo Sgaravatto
Richard Gokieli
Mike Kenyon
Helge Meinhard
Mike Kenyon
Patrick Fuhrmann
Ruth Pordes
Oxana Smirnova
Gianni Pucciani
Christoph Grab
Mihnea Dulea
Tizana Ferrari
Massimo Sgaravatto
Pete Gronbech
Mario David
EVO chat:
[09:14:37] Gianni Pucciani joined [09:03:01] CERN 31-3-004 can anyone hear CERN? [09:14:39] Alvaro Fernandez joined [09:14:40] Massimo Sgaravatto joined [09:14:42] Richard Gokieli joined [09:14:44] CERN 31-3-004 joined [09:16:46] Mike Kenyon joined [09:18:38] Helge Meinhard joined [09:21:11] Patrick Fuhrmann joined [09:21:11] Patrick Fuhrmann left [09:21:40] Ruth POrdes joined [09:23:04] Oxana Smirnova joined [09:26:49] Christoph Grab joined [09:35:05] Massimo Sgaravatto left [09:40:19] Tiziana Ferrari joined [09:43:06] MIhnea Dulea joined [09:45:59] Ruth POrdes left [10:15:10] Tiziana Ferrari left [10:27:11] Jukka Klem left [10:28:22] Gianni Pucciani the lcmaps rpm I am using to certify Argus 1.1 is not yet certified, I take it from Etics. [10:43:17] Tiziana Ferrari joined [10:46:29] luciano gaido joined
[10:55:40] Jeremy Coles The meeting will restart at 14:00. [10:57:26] Christoph Grab left [12:40:32] Mario David joined [12:40:47] Dietmar Kuhn joined [12:40:55] Pete Gronbech joined [12:41:01] Torsten Antoni joined [12:46:56] Ruth POrdes joined [12:57:09] Torsten Antoni left [12:58:01] Torsten Antoni joined [13:03:42] Steven Newhouse Now we have no sound again [13:03:52] Mike Kenyon It's gone rather quiet... [13:03:53] Oxana Smirnova something's killed again [13:04:10] Mike Kenyon [13:04:28] Steven Newhouse So there really is no ROSCOE [13:04:30] MIhnea Dulea Sound, please [13:04:47] Steven Newhouse Sound is back [13:10:09] Ruth POrdes left [13:10:14] Ruth POrdes joined [13:11:45] Patrick Fuhrmann left [13:11:52] Patrick Fuhrmann joined [13:14:16] luciano gaido left [13:14:25] luciano gaido joined [13:32:57] Dennis van Dok joined [13:37:33] Phone Bridge joined [13:38:50] Massimo Sgaravatto joined [13:53:56] Ionel STAN left [13:54:05] Ionel STAN joined [14:03:20] Helge Meinhard left [14:03:31] Helge Meinhard joined [14:14:28] Christoph Grab joined [14:15:29] Ruth POrdes left [14:15:34] Ruth POrdes joined [14:19:10] Jeremy Coles Coffee for 10 mins. [14:44:12] Helge Meinhard left [14:45:12] Helge Meinhard joined [14:56:47] Helge Meinhard left [15:00:17] Steven Newhouse left [15:02:49] Christoph Grab left [15:07:36] Torsten Antoni we have no sound... [15:07:52] Jeremy Coles There is sound for me, [15:09:22] Torsten Antoni audio is chopped up. [15:09:51] Jeremy Coles Can anyone else confirm this as an issue for them? [15:14:19] Ruth POrdes it is for me [15:17:18] Ionel STAN left [15:17:21] Jeremy Coles Is this for all speakers or just Maarten? [15:18:29] Ionel STAN joined [15:18:40] Ionel STAN left [15:18:58] Ionel STAN joined [15:20:23] RECORDING Ionel joined [15:25:25] Ionel STAN left [15:37:49] Ruth POrdes the audio is better now - yes. [15:38:41] Ruth POrdes When I start talking could somoene please type here the quality and problems .. immediately would help ! I am using IP rather than phone today. [15:38:43] Ruth POrdes thank you! [15:39:15] Jeremy Coles Wiil do. [15:52:30] Torsten Antoni left [15:55:17] Ionel STAN joined [15:58:58] Ionel STAN left [15:59:05] Ionel STAN joined [15:59:16] luciano gaido left [15:59:26] Ionel STAN left [15:59:50] Ruth POrdes I just spoke [15:59:58] Jeremy Coles You are clear now. [16:00:37] Ionel STAN joined [16:04:18] Dennis van Dok left [16:06:15] Gianni Pucciani left [16:08:10] Ionel STAN left
There are minutes attached to this event.
Show them.