28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
email@example.com Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
NB: Reports were not received in advance of the meeting from:
ROCs: AP, Italy
VOs: Atlas, Alice, BioMed, CMS, LHCb
Feedback on last meeting's minutes
<big> Grid-Operator-on-Duty handover </big>
From: CE / CERN
To: SWE / Italy
Report from CERN COD:
Site: ITPA-LCG2 was failing GSTAT. Its publishing ScientificSL 5.0 which is not in the OS list used by GSTAT.
In such case it should be the responsiblity of site/ROC to send a request to mailing list: firstname.lastname@example.org
to add there required OS version in the list.
*Please note rotation calendar for this week:*
Lead team: SouthWesternEurope
Backup team: Italy
PPS reports were not received from these ROCs:
AP, CE, IT, RU
Issues from EGEE ROCs:
None found. Actually, the pages displayed were completely empty, so I suspect a malfunctioning of the portal (reported)
gLite3.0.2 PPS Update47 released to PPS today and now in phase of pre-deployment test. The update contains, among others
FTA update (PATCH:1740). gridFTP session handling was changed: now copy and getFileSize are done in the same session: fix for BUG:33528
gLite3.1.0 PPS Update22 was released to PPS last Friday andit is now in phase of pre-deployment testing.
The release contains, among others, an update of yaim-core, so, technically, all services are concerned. The full list of patch deployed is:
1219 fix for DENY tags to lcg-info-dynamic-scheduler
1645 R3.1/SLC4/x86_64: GFAL/lcg_util update
1663 lcg-infosites (patch 1646 revisited)
1680 R3.1/SLC4/x86_64: GFAL 1.10.8
1709 [ YAIM ] yaim core and yaim lcg-ce 4.0.4 series
1728 [ YAIM ] glite-yaim-clients 4.0.3 series
1730 new lcg-ManageVOTAg version (solving bug 34245)
1738 R3.1/SLC4/i386: GFAL & lcg-util update l
1712 R-GMA fix for forwards compatibility
gLite3.1.0 PPS Update23 is in preparation:
The release will introduce the WMS and LB services for SL4
<big> EGEE issues coming from ROC reports </big>
(ROC France): Some site administrators complained because their e-mail address was added to a VO mailing-list without their agreement. The VO has been contacted and the problem is being solved, but that incident raises the more general problem of SPAM generated by the project itself. Could we agree in a good administration rule of mailing-list ? At least, except for some obvious and mandatory mailing-lists, an actor should have the possibility to unregister from any mailing-list by him/herself. The way to unregister should be made clear by the mailing list.
(ROC DECH): Please reopen action item 150. The problem is still present. see GGUS #33850.
(ROC SEE): https://gus.fzk.de/ws/ticket_info.php?ticket=33697 is long overdue,
please put some pressure on the corresponding support unit to respond.
(ROC SEE): LCG-TAU still has some problems, thus its is now in downtime for the next 7 days in order to upgrade to the latest gLite release.
(ROC SWE): We would like to have an update Top-BDII failover awareness on gLite client tools. Is it possible to confgiure several BDIIs in form of a list with yaim?
(ROC UKI): GGUS should respond whether the UKI-SOUTHGRID-CAM-HEP problem of 100 mails for the same ticket is a bug.
(ROC UKI): There have been many complaints in UKI about the move to the need to complete the site reports every day. Site admins often fill out the report for the week in one go and this seems a sensible approach - at least they should be able to choose. Several sites have indicated that they will stop filling out the reports in this new format. On the positive side the new interface seems better with the graphical representation of downtime etc. However, it would be very welcome if the colours used between tools were consistent. Previously grey represented downtime and red a failure... now we have black. Sites would also like to see the past history for the report so they can cross reference previous failures which is a feature lost in this upgrade.
(ROC UKI): The move to validating every use of a certificate on a site is becoming tedious. Is this a feature of the browser settings or does everyone get greeted with constant requests to use their certifcate? Is it possible to have a compact view and a detailed view of site problems? I can not see correlations between sites anymore.
<big> gLite Release News</big>
gLite 3.1 Update18, announced for last week, is being released to production right now.
We apologise for the delay, due to a technical issue in the preparation of the release.
The update contains:
NEW: glite-MON for SL4
fix for bug #33769: incorrect pool free space after dpm-drain
improved ACL management for srmMkdir command
lcg-tags non longer produces Globus warnings suppressed
voms-admin client 2.0.6-1 providing ACL support on command line
vdt_globus_essentials (affecting several services and notably the CE)
bug fix to prevent globus-job-manager processes to pile-up on a CE (bug observed at CERN after SAM WMS/RB tests were enabled )
voms-admin server (VOMS)
Refactored voms-admin-ping script
ACL management web service (compatible with client >= 2.0.6-1)
Registration web service.
many bug fixes
<big> WLCG issues coming from ROC reports </big>
None this week.
<big>WLCG Service Interventions (with dates / times where known) </big>
There will be a test of the Tier0 to Tier1 Optical Private Networks backup links from 15.00 to 19.00 CEST (13.00 to 17.00 UTC) on Wednesday 9 April.
The plan of the test is here:
RAL will be unreachable for 15-20 minutes between 16:45 and 17:15 CEST PIC will be unreachable for 15-20 minutes between 17:15 and 17:45 CEST
The goal of the maintenance is to verify that all the backup solutions work as expected. The T1s with a backup link should be up all the time, but at the moment we cannot guarantee that it will be the case and there may be outages at any time for any Tier1. It would be appreciated if experiments could particularly exercise their links as much as possible during the test period.
Harry Renshall / Jamie Shiers
<big> Alice report </big>
No report received before the meeting.
<big> Atlas report </big>
new version of DPM:
in today' s meeting I would like to know the status of the DPM server
version fixing the ACL problem. In particular, I would like to know if
this has been released to production and when (it should have been
last Wednesday). Also, I would like to know the exact version+patch
number so that I can refer to it in the proper manner.
ATLAS T2s need the patch to start running production on SRMv2 and we
would like to push the deployment of the patch ASAP.
Could you make sure someone can provide the infos mentioned above at
today's meeting? Both Alessandro and I will be present.
ANSWER: The version is DPM 1.6.7-4; it is in PPS and will be released to production today
ATLAS sites with lcg-utils for SRM2:
we have developped a SAM test to see which version of lcg-utils has been installed on the WN of the ATLAS supporting sites.
The results can be seen in the sam web page, selecting ATLAS VO, CE, CE-sft-lcg-version SAM link
The sites that give ERROR in this test didn't upgrade to the SRM2 compatible version of lcg-utils.
Hope this could help in following the action of having, in all the ATLAS supporting sites, the WN upgraded to SRM2
<big> CMS report </big>
No report received before the meeting.
<big> LHCb report </big>
No report received before the meeting.
(OSG - Indiana University)
Discussion of open tickets for OSG
The only outstanding ticket is: