grid-operations-meeting@cern.ch Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
GGUS representatives
VO representatives
VRVS "Sky" room will be available 15:30 until 18:00 CET
actionlist
minutes
28-R-15
28-R-15
1
Feedback on last meeting's minutes
Minutes
2
Grid-Operator-on-Duty handover
From France (backup: South East) to Italy (backup: Russia)
We will start moving to per-service upgrades.
First one, probably released today:
- CE: lcg-info-dynamic-scheduler fix for host/queue name matching
next ones in the queue:
- FTS
- DPM/LFC
- UI and WN
5
Change of format of operations meeting
1) EGEE Items
Grid-Operator-on-Duty handover
Any other items/announcements specific to EGEE (eg updates to mw)
Issues coming from VO and ROC reports (ROC reports not received)
2) OSG Items
Issues coming from OSG
3) WLCG Items
Upcoming SC4 Activities
Any other general WLCG items
WLCG related Issues coming from experiment VOs and Tier-1/Tier-2 reports (VO reports + Tier 1 reports not received)
4) Review of action items
5) Feedback on last meeting's minutes
6) AOB
6
REMINDER: to update to the 1.8 IGTF CA package for every service (not only the WNs, only ones checked with SFT)
7
Bug 16625: 10-50 times speedup for lcg-info-generic.
Reports were not received from:
ROCs: UKI (holiday)
Tier-1s (reports attached): BNL
VOs:
CE ROC: Improvements to gLite update release process needed.
1.A) (4444 jobs problem. Bug affects all sites containing a special character in domain name, or a queue name)
gLite updates for production sites should not contain packages that are known to have bugs. Package lcg-info-dynamic-scheduler released with gLite 3.0.2 contained well known bug that affects CEs with hostname containing character '-' or queue name containing underscores, uppercase letters, and numbers. This bug is not listed on any download page (e.g. http://glite.web.cern.ch/glite/packages/R3.0/deployment/lcg-CE/3.0.3/lcg-CE-3.0.3-update.html) as known issue.
Since the publish date this issue has generated at least three tickets:
https://gus.fzk.de/pages/ticket_details.php?ticket=11681&from=allt
https://gus.fzk.de/pages/ticket_details.php?ticket=11619&from=allt
https://savannah.cern.ch/bugs/?func=detailitem&item_id=19233
There's a lot of such CEs in central BDII so there will probably be more tickets. The worst part is that the problem has already been reported in May:
https://savannah.cern.ch/bugs/?func=detailitem&item_id=17716
and the patch has been available since July:
https://savannah.cern.ch/patch/? func=detailitem&item_id=754
1.B) Updates should be coordinated with central services (GSTAT is affected here). For example, MyProxy's ServiceType has changed from 'myproxy' to 'MyProxy' since Yaim 3.0.0-17 (release in June) and Gstat still issues warning on PROX nodes because it looks for 'myproxy' (https://gus.fzk.de/pages/ticket_details.php?ticket=11653&from=allt).
2. DECH: Is there a way to clean-up the RBs MySQL database for (very) old entries? Database files have sizes of many GBs already." (DESY-HH)
3. LHCb: I'd simply like to put more pressure on GridKA site admins whose site is failingreconstructions jobs for the on going DC06.
There is a GGUS ticket (#11599)describing the problem qhose priority was severe.
Problems with lcg-gt at GRIDKA in DC06
Detailed description:
Dear Site Manager,
Since several days now, when we (LHCb) are trying to run Reconstruction DC06 jobs at your site, for data we have just transfer to your site we get in to the following situation:
when the job issue lcg-gt commands to get appropriated TURL for the dcap protocol to be used by the application a large fraction of them timeout (by our own wrapper after 30 seconds) and thus the intput data to the jobs can not be resolved.
This same logic has been working fine at your site in the past, and it is also working at other Tier1's (PIC, RAL, IN2P3) and CERN.
Please investigate the problem and let us know if we can help you to debug the issue.