GDB

Europe/Zurich
IT Auditorium (CERN)

IT Auditorium

CERN

John Gordon (STFC-RAL)
Description
WLCG Grid Deployment Board monthly meeting
GDB Minutes – 9th November 2011
[These minutes have not yet been reviewed and validated by the chair]
Present at CERN (apologies if spellings are incorrect based on list deciphering):
Jeremy Coles – UK/GridPP (minutes)
Jhen Wei Huang – Taiwan/ASGC
Shu-Ting Liao – Taiwan/ASGC
Urvashi Karnani – IT-GT
Marian Babik – IT-GT
Christopher Jung – Germany/KIT
Simon Lin – Taiwan/ASGC
David Smith – IT-GT/CERN
Helge Mainhard – CERN IT-PES
Ulrich Schurikerath – CERN IT-PES
Manuel Guijarro – CERN IT-PES
Elisa Lanciotti – CERN IT-ES
Michel Jouvin – GRIF
Alessandro Digibol??? – CERN IT-ES
Jerome Belleman – CERN IT/ES
I Ueda – ATLAS/Tokyo
Andrea Sciaba – CERN/IT
Jeff Templon – Nikhef
Dirk Duellmann – CERN
Belinizo T? – CERN
Ian Bird – CERN
Antonio Peree – CERN
Bernd Panzer-Steindel – CERN IT
Maite Bauoso – CERN IT
Markus Schulz – CERN IT
Gavin McCance – CERN
Eric Lanson – ATLAS
F Chollet – IN2P3
Pierre Girard – CCIN2P3
Yannick Patok – IN2P3
David Collados – CERN
Milos Lokajicek – Prague-FZU
Luca Dell’Agnello – INFN
G Carcino – INFN
John Gordon – STFC-RAL UK (Chair)
Maarten Litmaath - CERN
 
Introduction (John Gordon)
Since the previous meeting – EMI AHM (17th-19th Oct) and HEPiX fall meeting (24th-28th Oct).
Next meeting is on 14th December. That meeting is in building 40. For next year there is a meeting on 11th January starting at 14:00 – devoted to TEG. There is a pre-GDB on 7th Feb for the TEG reports. There will be a GDB on 8th Feb. Trying to co-locate 14th March meeting with EGI Community Forum in Munich, moving to the following week clashes with the LHCC meeting. Following feedback today will aim at 21st.  Other dates 18th April, May 9th and June 13th.
SuperComputing 11 is coming 14th-17th November. An LHCONE meeting 1st-2nd December. Do we want to make a statement? Perhaps after that meeting.
There is an EGI Information Services meeting 1st December.
Ian Bird: Before we go to this workshop it is very important that we have a clear statement on what we need from information services in the future. We must understand what we still need.  We have already done work to improve the current system.
Markus: EGI needs to get involved because while we have tested the config advancements, but the most importat thing is not the BDII technology but the quality of the information and that is an oprtational task. EGI is an operator of the infrastructure.
John: EGI has been following up on staleness but there are more tests that can be developed.
Jeff: Staleness is not addressing Markus’s point.
John: Only the person at the site will know the correct setting. If EGI are doing it then it is better to let them do it. For a long time we have said our deadlines are now. But, with TEG we are looking several years ahead and EGI would be addressing on the 1 year timescale. Will ask Lawrence to give a report back.
Ian: Should make it clear we will not go with new requirements, just a request for published information to be cleaned up.
 ISGC 26th Feb – 2nd March. OGF34 11th-14th March. EGI User Forum 26th-30th March. WLCG workshop 19th-20th May. CHEP 2012 21st-25th May.
Ian: What do you see as useful in OGF now?
John: I’m involved in accounting – there are new standards coming. There is a thread on that. Glue2 as a long-term solution is being pursued and cloud interfacing… will be of interest to us. It is difficult to follow more than one track.
GDB Issues: Several passed to TEG – MU pilot jobs and SCAS/ARGUS…
Sites are no longer required to run LCG-CEs. Proposed removing LCG-CE from availability calculation but CREAM for SGE makes this unreliable. Will now aim at the end of the year for the change.
Disk pledges – Floods in Thailand are seriously affecting disk availability and price.
 
John: Was approach to use GDB as guinea pig for the Vidyo.
Ian: I have volunteered the GDB to test once the procurement is sorted and service in place. Probably early next year.
Action: John to email people and request that GDB members get signed up for Vidyo.
 
Technical Evolution Groups
Workload Management (Davide Salomoni)
Suggested scope includes many areas supposed to be covered by this TEG.
Common to all TEGs – initial assessment and strategy document.
33 people subscribed to the mailing list. Some Tier-1s not represented.
Held face-to-face meeting on 3rd and 4th November at CERN. Will hold biweekly phone meetings and then another face-to-face in January.
John: You may have heard the discussion about the 1st Dec workshop on information systems.
Davide: Yes we have that as a high priority and hope to have more information by then. I will reiterate – we feel that site representation can be better and invite others to join.
John: Perhaps incumbent on the countries to ensure coverage at these groups.
Ian: This was particularly commented on by security and operations groups. Both areas are important for sites and it is important that they do join as this can not work without their participation.
Michel Jouvin: Perhaps sending a reminder to the sites would help. It was not completely clear how representation would be decided.
JT: I sent the message to LCG-ROLLOUT as I thought it may get lost if not sent directly. The one thing I hear is that there is not a lot of people at the sites and they are already very busy… one objective of our TEG is to reduce the workload on site admins!
MJ: We do not always know joining which TEGs is most useful…
JT: We should be looking at the type of sites not just country representation.
RW: IN security the only country represented is Sweden!
 
Data Management (Dirk Duelmann)
Group has moved from ‘disorganised’ to ‘existing’. We think we have the stakeholder representation we need. Can work with the storage TEG to get more site coverage. One phone conference so far. Want statement upfront – what the document will be used for and a confirmation that the strategy will be implemented! This may be expressing a worry from previous discussions. Plan to start with the experiment points of view. Question about how to organise overlap areas – MB suggested that TEG chairs work closely.
JT: Missed the content delivery networks.
DD: Caching is something I would like to put in but there are so many caches such that people get confused very quickly without detailed explanation. I did not want to create complecity too early.
Markus: Real access control needs – sites and users. The current system is used to a very small percentage of its capability. Need to describe the use cases.
DD: VOMS is probably one of the special needs. Would hope security come up with minimal requirements to run the system.
IB: Another area is the protocols
DD: They got an enourous space even before we do the actual federation.
IB: If we are talking about sustainability etc then need to discuss http for example. The more I hear the more I think a common meeting with the storage group is needed.
DD: Need to identify which topics are overlap areas and which are completely separate.
MS: Was there anything about bulk file transfers? This is clearly DM. Also the use cases – the individuals now outside of the grid sites, I assume that is hard now.
DD: The interface to storage pools etc will be an overlap.
JG: Local, wide and direct access contention…
DD: It is important to have the right set of questions… do not yet have the answers
JG: The experiments have got the top of the stack….
DD: This is an important area to have in the discussion
 
Operational Tools (Jeff Templon)
This is a large area of work. Should the BDII be in here?
IB: Question on information services was in the workload management group. But we do need a list of operational services on which we rely.
JT: If talking about the ldap implementation then it is in scope…
JG: See BDII as a service as distinct from what is in it.
Aim to have weekly phone conference on Mondays. F2F at CERN on 28th November. 12th December workshop on Future Strategy.
 
JG: Site has got local information on the accounting so it can be compared. Compare local database with remote. One issue that is present is the benchmarking and we cannot probe remotely.
90% of sites have correct reporting but 10% perhaps are not correct. Perhaps this needs to be incorporated into the bullet about “accuracy of the accounting information unknown”.
Alessandra Forti: what there isn’t is a comparison between batch system and local database.
JG: I’ll come back to this later …  the installed capacity document is the WLCG requirement on storage accounting.
IB: And the RRB complain about this situation.
JG: On monitoring – is there a reliable site dashboard? Probes and monitoring from Nagios is fine but something that shows the sites the experiment experience.
JT: Experiments also have downtimes and these things are not always fed back
??: There was an update to the dashboard for sites. It is still in production. There are still problems in that the information is not complete but Julia’s group still supports it.
JT: Does anyone from the experiments want to comment on the completeness of the information shown?
Nobody from dashboard so … perhaps Julia could report back on the status of “Site view”.
Some missing stakesholders. Would really want more site input. Nov 28th is the deadline for first drafts so for those who want to contribute please do so now!
Accounting (John Gordon)
JT: What gets checked.
‘Existing EGI Infrastructure’ slide: As part of handshake the MON box can communicate with APEL repository informing it of the local information. By the portal stage the granularity at CE level is lost.
Messaging will move to Secure Stomp Manager.
JT: Current default if you go to portal is CPU time but it should be wall time. Can the default be changed?
JG: There is an EGI request tracker, please enter the request there.
JT: On the “others” slide “ambitions” are mentioned. Why without use cases are these presented? I am concerned about how the suggestions scale and the impact they have on resource centres. Process accounting is very expensive.
TF: Some of these were in the proposal to address business cases.
JG: Perhaps it can be implemented for certain areas like licensed software. Nothing to say these are mandatory and whether it is even technically feasible is on the assessment.
Site issues. There will be a new client and sites will need to EMI-APEL. It is already possible to move.
What LRMS support is needed? Any demand for Condor? Historically support for (S)GE has been weak.
 
Improvements in CERN Batch Accounting (Jerome Belleman)
 You said you were uploading indivudual jobs to the database – you mean your database?
JB: Yes.
Perhaps you can test more frequent uploads to the central database.
TF: You mention the collectors were written for LSF – why write new ones and not release them. How does the system compare to other systems?
JB: The idea was to end up with something simpler.
We wanted to be able to account for grid and non-grid jobs on LSF. We were not aware of any packages that already answered this need.
MJ: Is this a CERN specific development or could it replace the current LSF parser?
Yes it could easily be reused because there are few CERN specifics. Use Oracle database. The only gLite parts replaced was the collector on the CE which extracts to Oracle.
JG: At the moment the publisher client is not geared up to take individual benchmarks directly for each WN then that would be very useful (currently query the BDII).
 
 
 
LUNCH
 
Middleware
EMI (Doina Cristina Aiftimiei)
JT: The issue about the dynamic scheduler (slide 7, 3rd bullet under CREAM)…
DCA: It was a typo in YAIM that needed to be fixed.
JG: Previously (last meeting) there were questions if anyone else needed LFC on Oracle. There was another site needing it. The problem mentioned was finding somebody to test as CERN did not have effort available.
MS: LHCb rely on the replication mechanism of Oracle.
Oliver Keeble: Noted FTS 2.2.5 had stayed in rollout for a longtime and because it has a particular deployment scenario it can be handled differently.  The FTS pilot provides production exposure.
JG: We found this with other components – people actively testing with the developers but not saying this as part of the staged rollout. VOMS_oracle may be in this position.
WMS is not in UMD and is only supported now with security patches.
HM: We need to do something now – picked up WMS from developers because issues affecting acceptance into EMI-1 were not relevant to CERN.
ML: Both CNAF and CERN have participated in the validation. Various issues were found that would have caused trouble for all the LHC VOs using the WMS (still all of them). The last annoying issue has been fixed in EMI and the release was recently certified. At CNAF and CERN can use those repositories. But in general it is not advised that sites pick up things directly from EMI. Better for WMS to be going through staged rollout – that should not be a token effort given the work just mentioned. A UMD update was forseen in the near future which will contain the WMS. This should deal with the SL4 only supported issue for the WMS.
AA: It looks like many sites are taking things from EMI directly … we would like to make the pickup more official and talk with interested sites for official certification before release.
LdA: CNAF is using the EMI repository not the UMD one. My opinion is that the layer added by UMD is not useful. All the problems need to be identified before release.
MS: By design staged rollout in a project that has no infrastructure is not possible
AA: The discussion is to much project bound. The idea is to make things as smooth as possible. TO work with interested sites to help in the final stages of the certification to ensure high availability and quality of the middleware at the time of release. I’m not saying that staged rollout is useless but that it is perhaps done at the wrong point right now.
TF: The size and impact of staged rollout is complementary … more effort is needed in staged rollout.
JG: What held up WMS in UMD.
ML: It is in staged rollout for some time and significant issues were found. Not every community will have the same issues. The processes are complementary. Staged rollout can profit from EMI conducting testing at a higher level as in the WMS case with CERN and CNAF. A shorter rollout validation will be needed.
 
JG: What is the glite-release-bundle
It is the script used to prepare the UI.
 
gLite (Maria Alandes Pradillo)
Pending releases – it was noted that FTS will move straight to 2.2.8.
JG: On exceptions… what is the difference between CREAM 1.6.8 and the upcoming EMI CREAM with SGE fixes.
??: It is backporting.
 
 
VDT rpms (John for Alan of OSG)
No questions on the update.
 
WLCG Client Distribution (Oliver Keeble)
GT at CERN will continue to populate the Application Area’s AFS space as we move from gLite to EMI releases.
 
HEPiX
HEPiX Fall 2011 Highlights (Michel Jouvin)
All the usual threads with one new one emerging on business continuity.
JT: How should I interpret the word “business” here
MJ: Something to be understood. It is about service continuity and planning around disasters.
IPv6 – testbed activities, more sites invited to join even just to listen on the mailing list.
Virtualisation  - more later in this meeting.
Storage – Main focus is to establish a benchmark on various file system technologies. Will also look at storage solutions from the cloud world.
Benchmarking – Work still being published on w3.hepix.org. Revival may be needed after SPEC release new benchmark version next year.
Next meetings: Spring 2012 Prague – 23rd-27th April. Fall 2010 Beijing 2nd half October.
JG: There is a good synergy between WLCG and HEPiX and it helps WLCG who would otherwise have to setup test infrastructures etc.
CPU Benchmarking: Present and Future (Helge Meinhard)
HEP-SPEC06 resulted from HEPiX WG studies and was adopted by WLCG. This helped move to a performance count not just a box count in WLCG, it also gave a good way to express requirements and site pledges.
HS06 now well established and adopted beyond HEP. Vendors now use the benchmark.
JT: Adoption outside of HEP – is this on the grid or beyond?
HM: They are mostly grid communities using it outside of HEP.
However, the defined conditions are less correct as experiments move to 64-bit. It does not work well for whole node scheduling (SPEC rate may be better here).
Working group currently conclude that HS06 is no longer very accurate but it is probably “good enough”.
Expect SPECcpu v6 sometime next year and then we will need to reassess the benchmark for HEP anyway and this will require sites and experiments to contribute effort as back in 2008.
Stephen Burke: If it's a constant 15% scale factor does it make any difference for procurement?
HM: No.
JT: At last CHEP talks suggested we were missing 80%+ of the CPU capability due to using C++. So when do the experiments start to take notice of that? That is a factor of 3… whereas you are talking about a 20% effect.
JG: Some rewriting is going on to take advantage of multi-threading.
IU: All appreciate the HS06 benchmark. Real jobs are bound by disk i/o. HS06 does not show this…
JG: Could run a job with i/o or run an i/o benchmark too.
HM: In the interests of sites to ask experiments what are the requirements for disk, i/o etc. to make sure the pledged resources can be used effectively.
On the accounting. Can the benchmark be run with benchmarks?
JG: Benchmarks require specific conditions. Some nodes can also vary their performance based on demand. In the UK there have been comparisons made between effective HS06 based on seconds per ATLAS event vs declared HS06.
MJ: Steve Traylen pointed out that some clients had problems with sub-clusters… but if those clusters worked we could do what people mention here.
JT: It may be that the CREAM scripts were never got working.
JG: Perhaps ask Steve Traylen back to explain why it does not work.
JT: What is it about this that affects ATLAS?
IU: The reason for asking is that ATLAS maintains its own accounting and we do not always believe or understand the site declared HS06 values.
JG: For larger than average memory requirements perhaps CPU time gets wasted in other ways. The other issue of passing job parameters just mentioned should also be revisited.
 
Virtualisation (Tony Cass)
Image generation policy – done
Contextualisation standard – done
Catalogue and exchange tools - No members of the working group have a mandate to support any code for use at other sites or as a central catalogue.
JG: So are you saying this is an R&D project that wants to move on?
 
JT: A comment… I had an interesting discussion with people working on this and learnt that the experiments should be really interested in this so that they get running the image that they want.
JT: A low level question. You said contextulsation was interfacing with the site syslog?
TC: Sites have to maintain logs for 6 months. I do not know how 150 sites around the world have to store things so I need to allow you to route off various information to different areas where it gains more permanent existence.  For virtual machines you may have to be able to see the syslog trace especially if the node if running a multi-user pilot job.
IB: On the mandate, we have to agree this is what we want and then we can ask people to mandate it. At the moment we have not agreed.
There entailed a questions and discussion on CERNVM which was not recorded as the minute taker had to leave
 
EVO chat:
08:58:27] Oliver Keeble joined
[09:02:42] Fabio Hernandez joined
[09:03:48] Pablo Fernandez joined
[09:04:20] Stephen Burke Did the sound go off or is it just me?
[09:04:26] Tiziana Ferrari I can't hear
[09:04:31] Davide Salomoni me neither
[09:05:02] Pablo Fernandez neither do I
[09:05:59] Pablo Fernandez left
[09:06:05] Jeremy Coles John was setting up.
[09:06:12] David Collados joined
[09:06:34] Tiziana Ferrari left
[09:07:11] Stephen Jones left
[09:07:23] Tiziana Ferrari joined
[09:08:06] Stephen Jones joined
[09:12:05] Mario David joined
[09:14:01] Christopher Jung joined
[09:14:28] Christopher Jung left
[09:14:51] Alberto Aimar joined
[09:15:12] Philippe Charpentier joined
[09:15:18] Pablo Fernandez joined
[09:27:05] Paolo Veronesi joined
[09:31:56] Stefan Roiser joined
[09:32:09] Stephen Burke left
[09:36:12] luciano gaido joined
[09:36:20] Alessandra Forti joined
[09:40:53] Alessandra Forti all groups affect sites
[09:41:25] David Collados left
[09:41:27] Alessandra Forti indeed
[09:48:16] Andrea Ceccanti joined
[09:51:21] Alberto Aimar left
[09:51:40] Alberto Aimar joined
[09:52:00] Oliver Keeble left
[09:52:54] Oliver Keeble joined
[09:52:58] Tim Bell joined
[10:00:31] Andrew McNab joined
[10:04:20] Oxana Smirnova joined
[10:15:44] Philippe Charpentier left
[10:15:55] Philippe Charpentier joined
[10:16:23] Philippe Charpentier The video has gone...
[10:20:36] Jeremy Coles Hi Philippe - I can still see the video is being broadcast. Did it return for you?
[10:21:42] Philippe Charpentier no, but doesn't really matter... I can hear and follow the slides from Indico 
[10:21:50] Philippe Charpentier saves bandwidth actually 
[10:22:53] Alessandra Forti there is an alert from nagios
[10:23:00] Alessandra Forti for that
[10:23:22] Alessandra Forti what isn't there is the comparison between batch system and local database
[10:23:47] Alessandra Forti if one of the CE doesn't publish correctly it's difficult to know
[10:24:43] Jeremy Coles Hi Alessandra - note that there is no display at CERN showing your comments. Audio is open though.
[10:27:10] Stephen Burke joined
[10:32:28] Fabio Hernandez left
[10:40:35] Andrea Cristofori joined
[10:42:49] Alberto Aimar left
[10:46:56] Stefan Roiser left
[10:52:44] Tim Bell left
[10:56:36] Philippe Charpentier left
[11:19:52] Paolo Veronesi left
[11:22:57] Andrew McNab left
[11:23:00] Jeremy Coles Stopping for lunch. Restart at 14:00 CET
[11:23:04] luciano gaido left
12:51:33] David Collados joined
[12:51:42] luciano gaido joined
[12:51:44] Andrea Ceccanti left
[13:00:44] Paolo Veronesi joined
[13:01:35] Andrea Ceccanti joined
[13:02:05] Jeremy Coles John is just trying to rejoin CERN to EVO.
[13:02:34] Tony Cass joined
[13:06:03] Ulrich Schwickerath joined
[13:06:13] Stephen Burke Either CERN is in darkness or the video is broken ...
[13:06:33] Jeremy Coles The video works in my session.
[13:07:04] Jeremy Coles Does anyone else have a problem with the video?
[13:07:17] Ulrich Schwickerath works fine for me
[13:07:25] Jeremy Coles Thanks for confirming.
[13:07:51] Stephen Burke I relaunched it - it seems to be OK now
[13:08:18] Tiziana Ferrari left
[13:08:36] Tiziana Ferrari joined
[13:10:06] Pablo Fernandez left
[13:10:14] Pablo Fernandez joined
[13:12:46] luciano gaido left
[13:15:14] luciano gaido joined
[13:16:05] Mario David SARA
[13:16:39] Stephen Burke I'm pretty sure that RAL runs LFC on Oracle ...
[13:17:17] Mario David so... go ahead and offer 
[13:17:37] Stephen Burke Not up to me!
[13:18:05] Mario David was
[13:25:36] Stephen Jones left
[13:26:58] Stephen Jones joined
[13:37:10] Stephen Jones left
[13:39:50] Stephen Jones joined
[13:52:50] Ulrich Schwickerath left
[13:53:14] Andrea Ceccanti left
[13:56:44] Tiziana Ferrari I have to leave the meeting to join another one. Cheers.
[13:56:50] Tiziana Ferrari left
[14:07:17] Claudio Grandi joined
[14:07:22] Christoph Grab joined
[14:13:05] Stephen Burke If it's a constant 15% scale factor does it make any difference for procurement?
[14:22:59] Stephen Burke Code will maybe be rewritten in 2013 (long shutdown)
[14:23:13] Stephen Jones What language? ADA?
[14:24:12] Stephen Burke Still C++ I would think, but more emphasis on performance
[14:25:07] Stephen Burke Also whole-node jobs will need a new architecture
[14:32:13] Stephen Burke I don't have a mic on this computer ...
[14:32:45] Stephen Burke The WMS restriction is a separate queue per subcluster
[14:33:04] Stephen Burke What glite-cluster does is make it easier to have many subclusters
[14:43:31] Christoph Grab left
[14:44:26] Mario David left
[14:46:44] Andrea Cristofori left
There are minutes attached to this event. Show them.
    • 10:00 10:30
      Introduction
      Convener: Dr John Gordon (STFC - Science & Technology Facilities Council (GB))
      slides
    • 10:30 11:30
      Technical Evolution Groups
      • 10:30
        Workload Management 20m
        Speakers: Davide Salomoni (Universita e INFN (IT)), Dr Torre Wenaus (Brookhaven National Laboratory (US))
        Slides
      • 10:50
        Data Management 20m
        Speakers: Dr Brian Bockelman (University of Nebraska), Dirk Duellmann (CERN)
        Slides
      • 11:10
        Operational Tools 20m
        Speakers: Jeff Templon (NIKHEF (NL)), Dr Maria Girone (CERN)
        Slides
    • 11:30 12:00
      Accounting
      Convener: Dr John Gordon (STFC - Science & Technology Facilities Council (GB))
      CERN
      slides
    • 12:00 14:00
      Lunch 2h
    • 14:00 15:05
      Middleware
    • 15:05 16:05
      HEPiX
      • 15:05
        Summary of 20th Anniversary Meeting 30m
        Speaker: Michel Jouvin (Universite de Paris-Sud 11 (FR))
        Slides
      • 15:35
        Benchmarking 15m
        Speaker: Dr Helge Meinhard (CERN)
        Slides
      • 15:50
        Virtualisation 15m
        Speaker: Tony Cass (CERN)
        Slides