GDB - Lyon

Name: GDB - Lyon
Start: 2011-03-09T08:00:00+01:00
End: 2011-03-09T18:00:00+01:00
Location: CC-IN2P3, Lyon

Wednesday 9 Mar 2011, 08:00 → 18:00 Europe/Zurich

CC-IN2P3, Lyon

Description

Monthly Grid Deployment Board Meeting

Hide

March GDB – Lyon

Welcome (Pierre-Etienne Macchi)
Review of machine room status and plans.

Introduction (John Gordon)
LHCOPN meeting was in Lyon 10-11^th February.
Next meeting is April 6^th. May 11^th. June 8^th. The July 13^th meeting has been cancelled.
There is a 5^th dCache workshop in March 16^th-17^th. ISGC 21^st-25^th March . EGI UF 11^th-14^th April. Spring HEPiX 2^nd-6^th May.
News: R-GMA closed on 1^st March. Few remaining sites need to migrate to gLite-APEL. Sites in that state will not appear in the March MB accounting report.
There are 13 sites not publishing CPU installed capacity. More than 59 sites not publishing disk shares for LHC VOs – this is needed for the April RRB report.

CERN VM-FS Server Status (John Gordon for Ian Collier)
CERN IT support being finalized. Security audit report is ready. The replication and mirroring process is working. BNL making good progress.

CREAM integration status (Wojciech Lapka)
CREAM nagios probe equivalent to the LCG-CE. Full job-submission chain via WMS. GridView and ACE reports now available online.
The algorithms for ACS and GridView are the same so the results should be the same. The only difference new system uses APT which requires correct service definition in GOCDB.
See small differences for about 30 sites.
MS: Number of sites supporting?
WL: From our database
MS: You did not retrieve the data from the information system? We have 80 sites with only CREAM-CEs is wrong.
Slide 7. Over time period of February from 9th. For T1 sites difference seen is 3%.
Request sites to check services correctly declared in GOCDB.
A new FCR mechanism is currently being tested by CMS.
SAM migration – new programmatic interface http://tinyurl.com/sam-migrate
Deadline to move is June/July 2011 as this is the end of support for SLC4.
JT: When can we start to use the new system?
WL: As soon as fully tested we can
JG: New version of all tests is in production.
WL: The new PI is available.
JT: Is there a status page.
http://grid-monitoring.cern.ch/myegi
On slide 7 ARC is included.
WL: Yes – covers all CEs.
JG: Are your team responsible for myegi? In the UK we have different set of tests in myegi… worked with Nagios but had problems tailoring set of tests.
PG: You said that CMS is testing the new FCR mechanism. The top-bdii are now filtering with this ?
WL: No. It is in a test bdii. I can check the filtering for you .I can also provide a link for you to see if you are out of production.

CREAM status (John Gordon)
Correct availability calculation by the end of March.
Totals: Unique CREAM CEs: 251. Sites supporting CREAM CEs 181.
WLCG sites with no CREAM CEs.
CG: There is a blocking issue of CREAM integration with SGE. Problem affects CMS. I understand somebody is working on it but with limited manpower.
In EGI support for Condor is completely out of scope.
JG: We have a large number of sites running CREAM, they are not necessarily supporting all VOs. The number supporting LHC VOs is about where you would expect.
ML: There are EGI sites only supporting CREAM.
JG: If sites are on the list without CREAM CEs please take action.
MJ: What is the plan for removing support for the LCG-CE?
ML: At least through the summer support will continue. SGE and Condor sites can not move to CREAM. SGE is perhaps 20 sites. Condor is just a handful.
MJ: Do we take a decision to move LHC VOs off of the LCG-CE. Do we encourage sites to move off LCG-CE.
ML: Yes. Idea was to move. If you only run CREAM at the moment the availability calculation ignores the CE. The developers did not want to invest in fixing the old system.
JG: It is in the projects interest to phase out the LCG-CE. Most sites support other VOs. Do we know of other VOs who need the LCG-CE at a site.
JT: Only one I know of is dzero. They use the glidein-WMS. A lot of non-HEP VOs tend to like CREAM because of direct submission.
JG: Are there are other Vos outside HEP requiring LCG-CEs?
TF: I collected input from users and there were no explicit requests for the LCG-CE that I am aware of….

JG: So once availability calculation ad SGE components sorted then can suggest end of service date for LCG-CE. EGI and WLCG will have different tails for the migration.
ML: Can get rid of the bulk of LCG-CE in the near future.

Glexec & MUPJ (Maarten Litmaath)
Quick update – T0 and T1s mostly ok. Just TRIUMF on CREAM (not a problem with their LCG-CE WNs) and an issue for NDGF.
Nagios tests for “ops”. For CREAM https://samnag023.cern.ch/nagios/cgi-bin/status.cgi?.
ATLAS find “cmt” hanging now worked around. Target proxies not accepted by PanDA.
ML: Normally the pilot talks to PanDA. Perhaps there is some extra lookup … PanDA did not find FQAN. Possibly an issue with GridSite.
CMS. USCMS use glexec production at a few OSG sites. More widely… issues with location of glexec – suggested Claudio open tickets where a problem is encountered. Probably need to try one then the other location – OSG points at binary.
CG: Wanted to recheck the tests before opening tickets. At most T1s… CEs need to support role=pilot.
JG: MB asked sites to enable glexec but did not request it be available for all LHC VOs.
LHCb – running nagios tests. CREAM not yet tested.
ALICE – integrating glexec into AliEn. User proxy hangling still needs major development.
Relocatable code – already at Lyon. Pierre had to do some work. ML will follow up. Should not be hard. WN tarballs should be similar. Hard work is finding places in config where env variables are hardcoded. Jeff will follow up to.
JG: Ian gave two deadlines. T1s to do this by the end of this month. T2s by the end of June. Are there any other showstoppers?
JT: What you said is that it should not expected to work at T2s because there has not been a push. It should be configured to work easily.
ML: The push is to get the T2s to install – it is not part of the standard Tier-2s. At UK T2s where tested… issue was ARGUS/SCAS was not permitting any jobs besides the pilot jobs themselves. The pilot is the one that should be allowed to call glexec and then it should run any job for the VO. Misunderstanding… so ARGUS only allowed pilot as well. Also, some confusion about constructing the job exactly. … sandboxing the payload correctly etc.
JG: With direct submissions to CREAM do we need pilots so much?
ML: Yes. It is for getting the job into the system and accounting in a standard way.
JG:
Duncan: Different mappings ….
ML: ARGUS has that by design. It decides the mapping. This is the account, the groups etc. On headnodes you do not need to share. The mapping is going to be the same everywhere. The policy language is quite flexible. Standard uses cases can be prepresented. Can say… if WN is this then I want that. Based on where request comes from can have different decision.
JG: Different clusters with different pool accounts.
Pakaging and certifying in EMI 1. ARGUS will be one of the first node types we can just deploy from EMI1.
MS: Do we want to wait for EGI to approve EMI1 or just deploy.
JG: One scenario considered before, continue with gLite 3.2 and skip EMI1.
MS: CREAM CE 1.7 is a stand alone mode with little state. Reinstallation is not such an issue and here we get something.
JG: Just heard that only security patches from October 2011. Contentious because some 3.2 components only just being released. When does 12 months start – June or April for example.
MS: Also important software in gLite 3.2 that is not part of EMI. We did not need to make it explicit in the past. We need to find a strategy to deal with this. Many partners mixed funding but EMI takes position that they have full control.
TF: Waiting to see the release notes to understand the priorities. Consultation needs to start with NGIs.
JG: Just making point that this is the first time we’ll go through the process.
JT: We have no confidence of things working until have it in staged rollout.
Is there anyone looking at the milestones between the 3-5 projects to make sure things fit together. I have not seen a timeline covering this.
MS: Your centre probably uses s/w from 40 providers….
The magic word is trust. This is the first time to see the process working for EMI….
MS: If s/w can not be replaced then it needs to be supported.
JG: gLite as a consortium still has WLCG involvement.
ML: New flag can be set to receive such tests. Currently have issue – test should work but also CE should publish support. At the moment testing many services that will never support it.
JG: Need to know which tests should be working.
ML: A question. How are we going to push the T2 sites? Not keen on opening 200 tickets and following up.
JG: That’s why we at least some monitoring.
JT: We have ops… is that EGI. Could have wlcg-ops and make it critical.
JT: EGI discussing with ACE if they can have a different profile from WLCG.

LHC Open Network Environment (LHCONE) – John Shade
The website is http://lhcone.net
MS: Also Tier-1 links (slide 5).
The proposal. Build on exchange points – exchange points built in carrier-neutral facilities so that any connector can connect with their own fiber or using circuits provided by any telecom provider.
MS: What is a distributed exchange point?
JS: Some exchange points spread over many facilities. Run by one physical organization but does not have to be in one location. The distinction is important politically for some.
FF: What is the process for deciding these points?
JS: Geant and surfnet in Europe.
FF: But in Latin America, Asia Pacific etc. there is no organization.
Next steps- get feedback/approval from GDB and build a prototype plus refine the architecture document.
MS: For the monitoring. Will it be sufficiently detailed to allow FTS to use it.
Dashboard for OPN in progress. Idea being to help identify if the network might be at fault. It uses perfsonar. Can imagine that the data would be detailed enough to help with the scheduling. Monitoring traffic patterns is going to be important.
JG: Question about who has one of these points. What happens to those countries not involved. Use the commodity internet?
JS: That is a possibility.
FF: Who is leading LHCONE?
JS: At the moment the same crowd as LHCOPN. Need to bring in extra players from geant, esnet etc. It is open for participation.

File system stress testing (Xavier Canehan for Yannick Perret)
Choosing hardware uses spec but does not consider disks, connections etc.
JT: Examples – increase number jobs increase load… what would you expect?
XC: Expected to see a very slight slope.
Conclusions: Tests are complex (many variable) and have costs. Several constraints (e.g. budget) mean we might have to change the way we select hardware.
JT: One option not covered is not allowing all users. LHCb paper showed no cluster file systems could handle their application. Could tell LHCb to fix their environment. They raised many tickets on sites recently… the sites should say no…
HM: True that HEPSPEC06 is CPU only benchmark. Also, it is not fair to have CVMFS which is read-only distribution system marked as a distributed file system.
JG: For most use cases AFS is read-only.

ALICE (Maarten Litmaath)
MS: You leave no software after job finishes.
ML: No, jobs download to the working directory and it is cleaned up appropriately. ALICE jobs run for an hour or more and the mechanism works quite well and it Is clean – no cache to pollute or get corrupted.
A big date for ALICE is the QuarkMatter conference in Annecy 23^rd-28^th May

JG: Slide 11. Why were xrootd tests not integrated earlier?
ML: Xrootd storage elements did not get widely used until recently so the tests were not included in the availability calculation.
LHCb Operations Report (Stefan Roiser)
Due to a shortage of disk space, LHCb is re-visiting its computing model.
New tools in DIRAC to help with checking data consistency.
JT: Sounds like due to SRM problem the registration does not work properly. Try to delete it and …
RT: We don’t know but need to find out if this is SRM implementation issue.
JG: Did other VOs see a problem with the version of CREAM causing LHCb problems?
ML: No. The CREAM CE issues was related to the DN splitting. CREAM rpm update was needed.
Runtime environment was due to number of file operations during the setup step – afs then occasionally times out.
PG: At IN2P3 we resolved this on 24-core machines decreased the number of job slots.
SR: On faireshares we see a sawtooh pattern, it would be nice to increase the fairshare to get through jobs quicker.
JT: Description from users is a bit like somebody who reads the guidebook of a city but has never been there. Perhaps I should give a future GDB talk on this topic.
In maui the way implemented is thorugh a priotity function. Look at usage in groups vs what was assigned. Some sites use relative fairshares to normalize to what was pledged. Others use absolute such as 10% of total. Anyway, at any one point there is only one VO winning…. So all jobs at that point only start for that VO. LHCb jobs tend to go on for a day.. ATLAS jobs exit quickly.
MJ: Also the fairshare period is the point at which quotas get reset.
JT: For us median of two days with a damping factor has worked. If all VOs submitting a steady flow of jobs then it works perfectly!
JT: Slide 15. For the NIKHEF issue, note there is no simple way to restart CVMFS after you have done an upgrade so you need to drain the system before.
Question – how to notify T2s of important upgrades?
ML: I don’t see that we have a much stronger means to behave better. We have GDB, MB, workshops etc… but issues quickly forgotten.
MS: There is nothing like a “Tier-2”. Often it is the smaller sites that don’t upgrade quickly. Many follow lcg-rollout.
JG: Could also try to use the NGI channels. Or perhaps use a dedicated broadcast and include LCG-ROLLOUT.

ATLAS operations report (Stefan Jezequel)
Dec 2010-Mar2011 no specific activity.
JT; The EOS… what is the difference of this and normal xrootd from a user point of view?
SJ: There is no real difference but we are not using it with the xrootd part. It is not directly accessing the data.
JT: So you can not say if better for T2 to use xrootd or EOS?
SJ: Discussion concluded that currently they (Massimo/CERN) want to get experience with it and not provide it to sites. There is no documentation.
ML: There will be more news later in the year – perhaps
MS: People are worried about DPM support because a few people having been saying there is a problem. CERN have said that DPM will be supported.
JG: Why would you want to replace DPM?
JT: There is another xrootd project… Doug Benjamin… Any comments on that?
SJ: Yes I know about it, but it is more for T3s. It is not yet under ATLAS operations.
SJ: Slide on contributions - US sites making much bigger contribution to analysis than Europe. We need to understand this situation. I have seen this 1/3 for quite some time.
FC: It is because there is a lot of data there….
AH: But at GridKa we have 10% of share… Rod Walker investigated and he thought that users are moving to BNL because every dataset is there and it is a good working facility.
JT: On Storage news…. That’s interesting that CERN will migrate CASTOR to EOS if EOS is not yet production ready.
JG: It is the disk copies that is being migrated.
MS: And CASTOR remains the tape backend.
JT: Who did you talk to in order to get them to clean up dark data?
SJ: There is a twiki with a list of data to delete. The request was sent to the ATLAS clouds.
PG: And you tell the sites when you have finished?
SJ: Yes for T2s it goes via cloud support. I have direct contacts for T1s.
JG: On the PIC problem… what unique files were lost?
SJ: User files and in December we decided to stop the MC replication
GM: Of the ones recovered about 30% were in fact unique.

JT: You move from 10 points of failure to 1.
SJ: There is a replica at BNL.
You could have more replicas than just BNL.
ML: It is to reduce the operational issues – oracle infrastructure – reduce dependencies. Replicas at more sites is more expense and so on. LHCb are rethinking their model of distributed LFcs.
… the more copies you have the larger the problem of keeping them in-sync. ATLAS prefers single point for clearing up problems rather than continuously clearing up problems.
MS: But also this is more consistent since DDM database is already central.

Slide 14:
JG: So there is data exchange going via the T1s that is not at the T1? Like what?
SJ: User data for example.
GM: What is the status of the request (slide 15) to T1s to add new FTS channels?
SJ: Simone took it to the T1SC meeting 2 weeks ago and was gathering feedback.
GM: So the final request was not yet made?
SJ: Yes. Also issue with T2Ds changing.

Slide 18
JG: Most tape drives compress automatically… many will actually increase the file size when storing already zipped data.
SJ: To help fit raw copy on disk the compression needs to be done.
JG: Do you use CAF for prompt reco?
SJ: No for reprocessing.

CMS (Daniele Bonacorsi)
Not many major issues since last update.
Can rely on 6/7 T1s and 40/50 T2 sites at any one time.
SF: Slide 4 – site readiness. You say you have a plateau since a few years. Why?
DB: The plateau can not be much higher because sites … ready is at the end of a whole series of tests. In this computation you are always using a region where sites are either poorly connected or there is restructuring at the site. This is T2s ready at any given moment. … that is an average. Just because sites are “not ready” does not mean they do not get MC.
Yannick Patoris: 60% efficiency on slide 11. We get a bottleneck on storage in Strasburg. It seems we underestimate the amount of bandwith between Wn and SE. we would like to know what are the requirements.
DB: A valid point. A specific T2 it could be relatively easy to find reasons and here we provide the average and I am quite sure the majority of problems here are not due to bandwidth limitations. Trying in operations to group sites and address these clusters with similar problems…
JG: What is the spread on that plot. The standard deviation?
DB: About 5 or 10%
JG: If you have a spread then perhaps there is an optimum model for your T2s otherwise not.
JG: Slide 15 reminded me that sites can still install ARGUS before glexec tarball available.
Thank you to our hosts at Lyon.
Meeting closed 16:30.

EVO chat window:
09:16:41] Claudio Grandi joined
[09:16:41] H323 Loc Amphi joined
[09:16:43] Daniele Bonacorsi joined
[09:16:45] Pete Gronbech joined
[09:16:45] Yannick Patois joined
[09:16:45] John Shade joined
[09:16:46] Tiziana Ferrari joined
[09:16:46] Pete Gronbech Can the camera move slightly left to get all the screen in?
[09:16:46] Peter Oettl joined
[09:16:46] Brian Davies joined
[09:16:47] Pierre Girard joined
[09:16:47] Denise Heagerty joined
[09:16:47] Loc Amphi joined
[09:16:48] Massimo Sgaravatto joined
[09:16:48] Matt Hodges joined
[09:16:48] dimitris zilaskos joined
[09:16:48] Stephen Burke joined
[09:17:24] RECORDING John joined
[09:18:42] Pete Gronbech Perfect
[09:18:43] Stephen Burke Are there microphones in the room? Audience comments are quite faint.
[09:19:34] Yannick Patois t's OK here
[09:20:43] Yannick Patois (but rised Gain in the interface)
[09:21:21] Pierre Girard Hi Stephen, there are only fixed microphones. If necessary we will ask people to speak louder.
[09:21:25] Jeremy Coles Hi Stephen - it tested fine earlier. There is no portable mic visible to me... only a desk mike for the speaker.
[09:21:43] Jeremy Coles Mike = mic
[09:28:43] Richard Gokieli joined
[09:30:20] Richard Gokieli left
[09:30:53] Richard Gokieli joined
[09:33:00] Richard Gokieli left
[09:37:34] Richard Gokieli joined
[09:40:16] Alessandra Forti joined
[09:46:04] Massimo Sgaravatto can't hear
[09:46:17] Phone Bridge joined
[09:46:51] Derek Ross joined
[10:01:50] Alessandra Forti left
[10:35:27] Stephen Burke The people talkking in the background are making Markus inaudible
[10:35:48] Pete Gronbech
[10:36:50] Jeremy Coles On the phone bridge?
[10:36:57] Stephen Burke looks like it
[10:40:09] Pete Gronbech can people mute if they are not speaking please
[10:40:53] Stephen Burke the phone bridge presuably can't see the chat window!
[10:41:34] Pete Gronbech yes
[10:41:44] Pete Gronbech No it' sothers
[10:41:57] Massimo Sgaravatto left
[10:49:32] RECORDING Daniele joined
[10:51:29] Peter Oettl left
[11:01:08] Brian Davies left
[11:04:13] Phone Bridge left
[11:09:25] Tiziana Ferrari left
[11:27:36] Daniele Bonacorsi left
[13:00:08] Matt Hodges left
[13:00:13] Massimo Sgaravatto joined
[13:00:20] Luc FLORENTZ joined
[13:00:24] Tiziana Ferrari joined
[13:02:12] Daniele Bonacorsi joined
[13:02:39] Luc FLORENTZ left
[13:04:21] Claudio Grandi left
[13:04:25] Claudio Grandi joined
[13:04:52] Daniele Bonacorsi left
[13:04:56] Daniele Bonacorsi joined
[13:05:55] Daniele Bonacorsi left
[13:07:32] Daniele Bonacorsi joined
[13:08:34] Daniele Bonacorsi left
[13:10:08] Daniele Bonacorsi joined

There are minutes attached to this event. Show them.

- 09:00 → 10:00
  
  Tour of Machine Room
- 10:00 → 12:00
  Morning
  - 10:00
    
    Welcome 15m
    
    Speaker: Pierre-Etienne Macchi
    
    Slides
  - 10:15
    
    Introduction 15m
    
    Speaker: Dr John Gordon (STFC-RAL)
    
    CVMFS
    
    Slides
  - 10:30
    CREAM Migration Status 20m
    
    ACE - Gridview comparison, availabilities based on CREAM 10m
    
    Speaker: Wojciech Lapka (CERN)
    
    Slides
    
    Status 10m
    
    Slides
  - 10:50
    
    glexec and MUPJ 20m
    
    Slides
  - 11:10
    
    LHCONE 30m
    
    Speaker: John Shade (CERN)
    
    Slides
  - 11:40
    
    File System Stress testing 20m
    
    Speaker: Xavier Canehan
    
    Slides
- 12:00 → 14:00
  
  Lunch 2h
- 14:00 → 16:30
  Experiment Operations
  
  Operational Issues from the Experiments
  - 14:00
    
    Alice 30m
    
    Speaker: Maarten Litmaath (CERN)
    
    Slides
  - 14:30
    
    LHCb 30m
    
    Speaker: stefan roiser
    
    Slides
  - 15:00
    
    ATLAS 30m
    
    Speaker: Stephane Jezequel (Laboratoire d'Annecy-le-Vieux de Physique des Particules (LAPP))
    
    Slides
  - 15:30
    
    CMS 30m
    
    Speaker: Dr Daniele Bonacorsi (Univ. of Bologna, Italy)
    
    Slides