GDB

Europe/Zurich
30/7-018 - Kjell Johnsen Auditorium (CERN)

30/7-018 - Kjell Johnsen Auditorium

CERN

190
Show room on map
Description
Monthly meeting of the WLCG Grid Deployment Board
GDB     21st March 2012
 
Election. The voting was close and the results were announced at 11:15. The candidates were close but Michel Jouvin is the new chair.
 
Introduction (John Gordon)
Since the previous meeting  ISGC -26th Feb – 2nd March 2012. OGF 11th-14th March. Workshop on Science Applications and Infrastructure in Clouds and Grids 14th -15th March.
Next meetings: 18th April. 9th May. 13th June. 11th July. Room booked for Tuesdays for WGs.
Next week is the EGI Technical Forum.
HEPiX 23rd-27th April Prague
EMI AHM 8th-10th May
WLCG workshop 19th-20th May.
CHEP 2012 21st-25th May
 
Main future issues according to John: Can computing continue to scale with LHC? Where will the middleware come from?
 
Joao
Vidyo tool
3 parts.
1) Video portal (web based – flash). All CERN user get an account by default. Will be going through the tool.
Default view – My Room . Everyone has a personal room with no limitation on number of parcitipants.
Search tool allows you to look for users. If the name is grey then the user is offline. If Green then online. To see everyone put asterix in the search.  The square symbols represent the meetings that are on. You can click on a room to join. Rooms/meetings can be added to My Contacts list.
2) The video client
 
Sharing – 4th button. Can double click window for larger video display.
Under configuration:
Status. Can see participants. The router being used.
Show conference status shows who is broadcasting. If CPU usage rises to 90% you may start to get packets lost.
Network: The important thing here is the firewall. Typical behaviour when lets packets out but not in then call gets dropped. Error is not very useful.
Video: It is possible to specify the preference for bandwidth usage.
3) Moderator part
Allows view of those joined. Can mute them, remove their video, send email invitations, set a PIN etc.
Now the Indico part.
 
How do you connect via SIP.
J: with which client?
The information will be on this page service-vidyo.web.cern.ch/node/17.  A new method will be available in April – making SIP as the phone connection.
You have said you need version X of the client to connect to X version of the sever. At the moment it is only a Q&A check and matching is the only way to be sure for production.
When 2.4 is released, most of the features missing such as chat will be included. All the clients will be released. We will publish reminding people to upgrade clients but you do not have to upgrade. We will have extra support for the transition.
 
Updating the CERN Computer Centre Infrastructure (Ian Bird)
 
Extension of the Tier-0 will be in Budapest.
JT: Is EGI part of the project?
IB: One of the goals is to integrate this into the eInfrastructures. The EGI organisation (not project) is part of this. Whether they see themselves as a provider is not clear but they need to be there for the roadmap.
I’m guessing that it may be different for public clouds not the commercial clouds. The issue with commercial clouds has been on the data transfer.
JG: One provider was changing the rules – perhaps Amazon – indicating that transferring data will not be as expensive.
Have you met the challenge you set yourself on using external resources?
IB: The point was not to throw away things where we have a lot of investment – such as Lemon.
JG: You have so many boxes there can you resist doing more development?
IB: Yes. A number of people have a vested interest in making this work. So far there have been few situations where CERN has needed to do something special/different. There was one example from another community – perhaps support for RH – but the solution goes back into what everyone can get.
??: Any examples of hardware…? The
IB: It has to be scientific work. There will certainly be other things like databases etc. Ultimately we’ll need to use it for business continuity. Currently looking at the model on how we use it. We want to avoid special solutions for each experiment. It is part of the Tier-0. We may decide it is the CAF or some other specific area but that is to be decided. It will run with a CERN IP address.
 
Grid Engine Scientific Sites (Philippe Olivero)
No questions/comments
 
Experiment Operations
ALICE operations (Latchezar Betev)
No audio from CERN for most of the talk and then insufficient volume to hear the questions.
Q: Do you have enough tapes?
LB: Yes we do. We have written only raw data and have replicated only good raw data. CCIN2P3 is full but not elsewhere.
JG: CPU efficiency was poor for ALICE.
LB: The jobs are I/O heavy and then we make cuts. This type of job is not inefficient it simply does not use a lot of CPU. The jobs are always sent to where the data is held.
 
 
LHCb (Philippe Charpentier)
JG: 2013 is not what you mean by….[audio cut]
PC: No, this is not the same as the reprocessing at the end of this year. The open selections will start in the spring of 2013.
 
JG: Slide 10  “Limit per process, not process group” – Is this a difference of process and process group. Is that related to the batch system?
PC: Yes. Apparently on some batch systems one can set up a limit per process. This is good for us  - if the application process is killed then the framework can pick up what happened. If the job is killed then we get no log and this is not good for diagnostics.
JT: pvem is used at Nikhef for Torque. Same as setting the ulimit. Process gets an out of memory signal. Otherwise the program gets something like ‘job killed by administrator’.
Mario: A few weeks ago – PC sent a message to us and EGI about low disk space for actual working directories. I think this has triggered the need for a wiki or communication where it is defined what is needed or expected from each LHC VOs. Include in that the vmem just mentioned in the presentation. The VO card does not have for example the need for the memory increase. Some things are approved by the GDB but not propagated. Need an area to go to when decisions made by GDB, for example to configure CVMFS. All sites should deprecate LCG-CEs.
JG: This issue of TMP was defined a long time ago and has just been forgotten.
MM: But it is not just that. Sites are not always aware of the misconfigurations observed and commented on in the GDB. I suggest a wiki area for an update on requirements.
JG: The VO cards are one place to do this. What they are not doing is putting the things side by side (i.e. where experiments need things to be different).
MM: For each experiment each would put their requirements. I am assuming each is unique and has something different.
JG: You want a place to see the status/decisions if you miss the GDB etc.?
MM: yes. There are things that are generic and also things that are specific. The experiments report problems from time to time.
JT: This should be in the VO ID card. Perhaps it is not always done. Having two places is not a good idea as they would get out of sync.
JG: The middleware is already recorded in the baseline pages. We could look at having a repository of GDB decisions.
JT: Noticed around 1st March that LHCb workload changed.
PC: This is the restriping I mentioned.
 
CMS (Stephen Gowdy)
Audio from CERN insufficient to catch questions raised during talks.
JG: Intrigued by Tier-1 comment. Why do they not want to give you a whole node?
SG: I assume it is that if we don’t schedule jobs on it then the resource is more wasted.
??: A node can have a different number of cores. If job requests 8 or 12-core we can allocate it. The fact that it is not a whole node is not so relevant. The scheduler needs to put job where it can run jobs.
SG: If it is a whole node then you take what is there.
??: For our T2 we only have 150 machines so if these start being dedicated then many jobs will get queued while waiting for final core usage.
JG: Would be interesting to see if the assertions are true by letting the experiments try it.
??: Discussed in WM TEG and a recommendation will be in the report. 64-core machines have been tested. At BNL the results were not good while at KIT they were fine. So dedicating machines depends very much on the sort of sites.
SC: In the slide you say ATLAS will use the glide-in WMS. Actually at the moment we are investigating but have no firm plans.
JG: How does it work with glide-in?
CG: Uses Condor-G to do the submission to the CE. Condor-G supports all.
 
ATLAS (Alessandra Di Girolamo)
JG: Do you see the same problem as LHCb with process group?
AG: We don’t go into that level. What we are asking is to set the limit per job.
[Unable to hear questions] On Slide 13 and use of the VOMS free attribute.
ML: This is one of the issues  of the mapping of the storage element we tried to discuss in security TEG where data management is concerned. Perhaps in the coming weeks we can propose a solution but unlikely we can make a big change just like that. Probably a few months at the earliest. EOS would be a good candidate. We have a lot of legacy to remain compatible with.
 
JG: One caution. If you automate tickets to sites I would be beware.
AG: Tickets are done internally not automatically.
 
??: As one of the LFC sites coming along I wonder when the date will be for the change?
AG: Not sure. Just finished France and it may be BNL next.  We were thinking to try a smaller site next.
 
JT: How critical is memory limit to be at 4GB vs 3.8GB?
AG: We see jobs at 3.5GB getting killed.
JT: If 3.8GB is fine then it helps otherwise we need to make some big changes.
??: If the OS can’t deliver the 4GB then okay….
[Could not hear the response]
??: Was this about the vmem or the swap? Did not know it was connected with the scheduler.
 
Middleware
EMI News (Cristina Aiftimiei)
TF: VOMS_oracle will be tested by biomed and so will be included in the next staged rollout.
 
AG: We have observed some issues with the EMI WMS release. Not clear to me if this will be fixed in future release.
CA: What is the problem you mention?
AG: Two problems. One impacted the software installation – exit code 0 from CE even if not successful. Another reletaed to the ARC CE
CA: The ARC CE was discussed yesterday. The product team are working to fix this problem as soon as possible in next update.
MS: Is the correct way to address LCG-CE issues to use a CREAM-CE?
CA: Yes.
ML: OSG CEs are also affected in the same way. There are a flurry of bugs. The most important related to ARC CE changes at some NorduGrid sites. This exposed bug in a Condor component that is used in LCG-CEs and OSG CEs. At CNAF a fix has been applied. To my knowledge that will be officially released shortly. Then, it was discovered that the whole submission to ARC CEs have various problems – the previous working was apparently by chance! Few ARC CEs are used through the WMS. The WMS and ARC developers are now discussing this issue – the fix may require work on both sides. The other side of Alessandros comment – jobs not sent to CREAM always end with exit status 0 and the installation manager can not rely on the exit status. The ticket on this has been opened and accepted. Fortunately it was not a showstopper.
 
WLCG recommendations (Markus Schulz)
SC: ON SL6 and EMI2. Sites that want to move directly gLite to EMI2. The same reasoning applies to the experiments too. ATLAS for example run a fair number of clients at CERN to avoid the double hop. Good to have the agreement. However, there are some features only in EMI-1 that would be good to have – for example the LFC. There is one data management rpm could that be taken to a special gLite update?
MS: The current gLite 3.2 release is at a reasonable state. If a specific thing is needed for VO boxes. Need to look at whether need full blown gLite release or just an rpm.
JG: Comment made note about two complete installs. Does EMI-1 to EMI-2 require this?
MS: Well a reinstall for SL6 is needed. However, we have to see how quickly the experiments move. EMI-2 on SL6 has not yet been released.  From the computer coordinator meeting recently, we should also look at running the SL5 code of the experiments on SL6.
SC: Athena is running on SL6 but there is also the full job environment to test.
MS: And we may not want to move to full production during the run.
 
New chair selected. Will negotiate to see who chairs April.
JT: Thank you John for chairing the GDB for many years.
 
Meeting ended:16:50.
 
Those connected to Vidyo:
Oxana Smirnova
T. Ferrari (EGI.eu)
Gabriel Stoicea
Tiju Idiculla
Alberto Aimar
Cipfiz
Denise Heagerty
Donia Cristina Aiftimie
Christophorus Grab
Maria Dimou
Helene Cordier
Jose Hernandez Calama
Jeremy Coles
Luca
Wahid
Romain Wartel
Stephen Burke
Tony Cass
Joao CF
I Ueda
Philippe Charpentier
+33478930880
Mario David
John Gordon
Anders Waananen
Milos Lokajicek
Alberto Pace
 
 
 
 
 
There are minutes attached to this event. Show them.
    • 10:00 10:30
      Election of GDB Chair
      Convener: Ian Bird (CERN)
    • 10:30 10:45
      Introduction
      Convener: Dr John Gordon (STFC - Science & Technology Facilities Council (GB))
      March2007
      May2007
      old
      slides
    • 10:45 11:15
      VIDYO

      A demonstration of the features of Vidyo

      Convener: Mr Joao Correia Fernandes (CERN)
    • 11:15 12:00
      Modernisation of the CERN Computing Infrastructure
      Convener: Ian Bird (CERN)
      slides
    • 12:00 12:15
      GESS Collaboration

      Grid Engine Scientific Sites Collaboration

      Convener: philippe olivero (CC-IN2P3)
      slides
    • 12:15 14:00
      Lunch 1h 45m
    • 14:00 16:00
      Experiment Operations
      • 14:00
        Alice 30m
        Speaker: Latchezar Betev (CERN)
        Slides
      • 14:30
        LHCb 30m
        Speaker: Philippe Charpentier (CERN)
        Slides
      • 15:00
        CMS 30m
        Speaker: Stephen Gowdy (CERN)
        Slides
      • 15:30
        ATLAS 30m
        Speaker: Alessandro Di Girolamo (CERN)
        Slides
    • 16:00 16:40
      Middleware
      • 16:00
        EMI News 20m
        Speaker: Doina Cristina Aiftimiei (Istituto Nazionale Fisica Nucleare (IT))
        Slides
      • 16:20
        WLCG Recommendations 20m
        Speaker: Dr Markus Schulz (CERN)
        Slides