28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
email@example.com Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
NB: Reports were not received in advance of the meeting from:
Feedback on last meeting's minutes
<big> Grid-Operator-on-Duty handover </big>
From: SEE / AP
To: France / NDGF
Report from SEE COD:
Please note the following about these sites:
GGUS Ticket #34393
No update, problem still there.
GGUS Ticket #26634
SAM tests are not stable,
the problem is still there, no updates
to the ticket.
The new COD dashboard interface seems to be better.
Report from AP COD:
On 4/23-24 The CIC Alarms interface was not working.
On 4/25 GOCDB was not be access able and it also affected COD portal and COD
<big> PPS Report & Issues </big>
PPS reports were not received from these ROCs:
AP IT NE SEE
Issues from EGEE ROCs:
ROC France: First attempt to deploy the Tarball version of WN gLite3.1 x86_64. Except a package missing, seems ok.
ROC UKI Due to problems with the hardware on which the UKI-SOUTHGRID-BHAM-PPS site is installed, the site will no longer be maintaining a PPS site Comment (PPS coordination): As the site was involved in pre-deployment testing, this info has to be forwarded to the test coordinator (Mario David)
<big> gLite Release News</big>
Now in production
gLite 3.1.0 Update20 and 21were released to production with HIGH priority.
Update 21 was an urgent fix for a compatibility issue affecting lcg-CEs still running at version 3.0 introduced by Update 20
The main changes introduced by Update20 (relevant for CCRC08) are:
new feature: glite-data-gfal version (1.10.11-1)
provides new functions gfal_abortrequest and gfal_abortfilesseveral,
new feature: glite-data-dm-util (lcg_util) version (1.6.11-1) now
prints the SE type (SRMv1, SRMv2, Classic SE) in verbose mode (when relevant)
bug fix: lcg-ls does not work for the classic SE
bug fix: lcg-cr glibc memory corruption
bug fix: gfal_stat seg. fault with dummy LFN
bug fix: lcg-sd doesn't doesn't work with SRMv2 request token
bug fix: lcg-gt segmentation fault
fix globus-cass-cache problem on WN
fix problem of replication of a zero-length file improve logging of updatefilestatus method
DICOM back-end service for DPM
producing re-buildable source RPMs
group writable directories when SRM started with umask 0
DPM-DSI: DPM's gridftp does not allow for ':' in SURL (GGUS ticket #32335)
support for CKSM (md5 only yet)
Changes in Globus jobmanager and GASS cache. These modifications
improve the performance of the lcg-CE by a factor of two to three
Details in http://glite.web.cern.ch/glite/packages/R3.1/updates.asp
Now in pre-production
PPS site are now upgrading to gLite 3.1.0 PPS Updates 25 and 26:
dynamic service publisher, replacing the previous static configuration
Major dcache version change, adds support for SRM 2.2.
new VOMS core 1.8.3-4 (affecting VOMS servers and clients on UI WN VOBOX CE SE_dpm LFC WMS LB
CE, DECH ROCs: Admins are complaining about production updates which are not checked enough. It is much better to invest some more effort of one tester in testing, than hundreds of site administrators in debugging problems. They complain that this time it wasn't even possible to run YAIM on CE. Reply (Release managers and pre-production teams): We apologise for the disruption caused. Update20 was accelerated due to requirements coming from CCRC08. The installation issue with the CE was due to a mistake in the release preparation, because the dependency of the installation function from the new version of yaim-core version was not correctly set. Of course, as the correct version of yaim-core was already deployed in pre-production (but not in production) this issue was not visible for the pre-deployment testers in PPS. This particular issue could only have been trapped by a deployment test in production (currently not foreseen by the release procedure). BTW: yaim-core was being held back in PPS because it forced a change in the permissions schema for the site-info.def and containing directory to be implemented at all sites, which was not rated acceptable for the operations.
The issue found later on in production affecting the submission from CE3.0 to WN3.1 has another explanation. CE at version 3.1 has been in production for more than two months, which means that regression tests are not being done in certification. Pre-production run, by mandate, the top version of the services
From: 22-04-2008 (Tue) 07:45 UTC
To: 23-04-2008 (Wed) 13:30 UTC
Affected services: all
Symptoms: problems/fixes propagated to SAM possibly 1 hour
later than normal (tests only in every odd hour)
Reason: upgrade of SAM UI (SLC4, gLite 3.1)
Solution: sorting out problems arising during the
installation + testing
top-BDII config generator tool
From: 23-04-2008 (Wed) 16:15 UTC
To: 23-04-2008 (Wed) 21:15 UTC
Symptom: presence of OSG sites alternating
Reason: misconfiguration of the top-BDII config generator
Solution: configuration fixed
1.6.7-4 and 1.6.10 DPM releases were not found for glite 3.0. Is this only available for glite 3.1? Answer (gLite Release team): DPM is not supported anymore on 3.0. Be aware that this means that no regression testing is currently being done for services in this version
CSCS did not upgrade their dCache installation on Friday as originally scheduled. They expected a minor update as suggested by the release numbering. But because it turned out that the configuration and the installation scripts had changed, they decided not to take the risk of breaking their installation, which is running on Solaris machines, on a Friday afternoon. They encourage dCache developers to use proper numbering semantics, which would help distinguish between minor and major updates.
Could we get official information on the requirements for T2s to participate in the coming CCRC? (There had been unspecific complains about a lack of reactivity on the side of T2s!?! We are not aware of any such problems with sites in our region, but would like to encourage VOs to let us know, if there are any such concerns)
daily edited Availability comments appear without dates in the weekly availability site report. Would it be possible for the CIC-team to add timestamps for this part of the roc report? This would help us to prepare the ROC summary.
SWE ROC: There was an request og knowing how many sites need information on configuring the SGE batch system for short deadline jobs. Are this short deadline jobs obligatory? Which VOs do request this feature?
UKI ROC: (Feedback to be passed to the CIC portal team from one site: Table layout of weekly report is far from ideal. It is very easy to mix a detail field belonging to a failure with the comment box belonging to the previous or next failure).
<big> WLCG issues coming from ROC reports </big>
AP ROC: Site decommissioning
GOG-Singapore would like to decommission their site by June 2, 2008
They support the following VOs: Alice, Atlas, CMS, LHCb
Please migrate what is still needed by your VO before the site is disabled
<big>WLCG Service Interventions (with dates / times where known) </big>
The Classic SEs at IN2P3-LPC are planned to be removed from production the 15th May:
Please backup your data before that date.
The old Edinburgh site, ce.epcc.ed.ac.uk will be retired from use in one week time (1 May 2008). Storage services, via srm.epcc.ed.ac.uk, will be accessible via the new Edinburgh site, ce.glite.ecdf.ed.ac.uk for some time after this, although the intention is to slowly migrate to newer storage.
This means that support for several VOs will be dropped by Edinburgh, as they are not part of UKI-SCOTGRID-ECDF's supported VO list. In particular, these vos are:
alice, babar, biomed, cdf, cms, dzero, esr, fusion, geant4, hone, magic, minos, na48, planck, sixt, t2k and zeus
At the start of May, the site egee.man.poznan.pl will be removed from production and shut down. Please backup your data stored on storage elements belonging to this site.