28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
email@example.com Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
SA3 put forward a proposal for a centralized distribution mechanism for the gLite clients (WN).
Several responses have been received so far and are attached here.
<big> NEW: Broadcasting of downtimes of Operations Tools (GOC DB, CIC portal, etc.) </big>
<big> WLCG issues coming from ROC reports </big>
[SWE ROC]: CMS opened a ticket to the site LIP-Coimbra telling that the disk space for CMS is full. Would it not be better to assign this kind of ticket to the VO instead of the site supposing that the site while fulfills the capacities agreed by a MoU or similar?
<big> End points for FTM service at tier-1 sites </big>
<big>FTS SL4 - required by the experiments or tier-1 sites?</big>
Alice: Neutral (as long as there is no disruption to the service.
ATLAS: Prefer not to; to avoid introducing problems this close to data taking.
CMS: Priority is stability for data taking days. Whatever is scheduled in advance *and* allows some pre-testing can be negotiated, though. On CERN migration, instead, PhEDEx /Prod vs /Debug instance can be played with to allow testing before going into prod (talked to Gavin)
LHCb: Neutral (as long as there is no disruption to the service.
ASGC:???BNL: Need to migrate (Has a fairly pressing need to move to SL/RHEL4 because of our site security situation. If it is made available in production soon, we would definitely switch over.)
FNAL: Need to migrate (Hardware is dating fast. May be issues with maintenance.)
FZK: Prefer to wait (to include patch for SRM1 requests issued by FTM)
IN2P3: Can wait until next shutdown.
INFN: ???NDGF: Prefer to wait until next shutdown.
PIC: ???RAL: ???SARA/Nikhef: ???TRIUMF: Can wait until next shutdown.
<big>WLCG Service Interventions (with dates / times where known) </big>
Global Run data taking with the magnet at 3T over some part of the weekend.
CERN-IT and T0 workflows:
Migration data transferred into the local CAF-DBS instance for public information and access got slow for an issue debugged over the weekend and now understood, 11k blocks to go, may take up to 3 days to digest, does not worth any action, just let it go, since insertion of CAF-urgent datasets can (and was already successfully) be forced manually, thus causing no troubles for CERN-local analysis access.
Distributed sites issues:
T1_ES_PIC failures in CMS-specific SAM analysis test (missing input dataset: already fixed, thanks to Pepe Flix)
T1_DE_FZK failures in CMS-specific SAM analysis test (missing input dataset)
T2_CH_CSCS: No JobRobot jobs assigned (bdII ok?) + CMS-specific js and jsprod tests fail ("no compatible resources")
T2_US_NEBRASKA: No JobRobot jobs assigned (bdII ok?)
T2_UK_London_Brunel: Aborted JobRobot jobs ("Job got an error while in the CondorG queue")
T2_US_Wisconsin: No JobRobot jobs assigned (bdII ok?)
T2_ES_CIEMAT: CMS-specific SAM errors in analysis and js tests (timeout executing tests)
T2_PT_LIP_Coimbra : CMS-specific SAM CE errors in jsprod + dCache "No space left on device" (acknowledged)
T2_US_MIT: CMS-specific SAM Frontier error ("Error ping from t2bat0080.cmsaf.mit.edu to squid.cmsaf.mit.edu": the latter is down.)
T2_US_Wisconsin: CMS-specific SAM tests not running since 8/29 (some problems in bdII? JobRobot is not running too)
<big> LHCb report </big>
<big> Storage services: Recommended base versions </big>
ATLAS ask sites to setup USER and GROUP space tokens, setting
specific ACLs to protect access to such areas. Furthermore, they have
asked to set specific ACLs on directories used to access those spaces.
The ATLAS request cannot be fulfilled at the moment nor by DPM nor by
dCache installations. Sites are therefore asked to just setup the space
tokens allowing access to generic ATLAS users to both files and directories.
Next release of dCache (1.8.0-16) will have support for ACLs on
directories. This will allow site administrators to setup correctly what
ATLAS has asked.
For DPM, the release that allows to set multiple ACLs on spaces is
still in the hands of the developers.
Release of dCache 1.8.0-15p12 which is supposed to come out this week
has a fix for ATLAS Tier-1s. The Pin specified on BringOnline will start
after the file has been brought on disk and not at the time the request
was issued, as before. After this patch release, the dCache team will
concentrate on release 1.8.0-16. Therefore, no more patch releases to
1.8.0-15 will be made available unless very critical bugs will be reported.