28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
firstname.lastname@example.org Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0148141
OR click HERE (Please specify your name & affiliation in the web-interface)
Details of tickets reaching final step of escalation:
ROC_North, ITPA-LCG2, GGUS:42015, Nothing since 16th October.
Report from David Groep:
JP-HIROSHIMA-WLCG: id#9165 - GGUS Ticket #41683
No response whatsoever from this site. Nothing. Not a single bit.
SDU-LCG2: id#9164 - GGUS Ticket #41680
Absolutely no response from the site for 30 days. Not sure why we keep wasting our time on this one. The site has been dead with the very same "File not available.Cannot read JobWrapper output, both from Condor and from Maradona." error.
Maybe escalation will trigger a response. Set expiry to 3/NOV
This week saw a lot of SE downtime that affects the associated CEs. Especially ELTE-HU where the SE iSCSI interface is broken!
R-GMA at ELTE should be up according to mail exchange of follow-up, but is actually still down.
STORM front end at INFN-T1 remains unstable. The issue is acknowledged but associated errors keep popping up.
<big> PPS Report & Issues </big>
As of last Monday, the VOMS pilot service is installed
with the voms from PATCH:2390; voms proxies are available from it. All PPS sites
are invited to re-configure their UIs to use this pilot service.
ROC Italy were the only ROC not to have submitted a report by the 14:00 deadline.
ROC Germany/Switzerland:BDII Problems
Region experienced problems with the new (TopLevel) BDII release: some queries give no output. With old versions this problem did not occur. Are other sites also affected?
For example the WMS show entries like:
DATUM -I: [Info] fetch_bdii_ce_info(ldap-utils.cpp:567): zeus: skipped due to empty ACBR.
ROC SWE:SRM failures explained:
PIC supplied details concerning one hour of SAM failures on 30-Oct-2008.
ATLAS were running jobs at PIC which were reading several files via SRM, using
lcgcp (up to 14k srmget/hour). This generated a high load in the SRM, which didn't service the SAM tests quickly enough.
Solution: The ATLAS contact person has asked to change the local access protocol for
reading from lcgcp to dcap (dccp). However, until the change is made, the problem
could come back. As medium/long term solution they're thinking of a SRM server upgrade (x64+more RAM for catalina), and possibly splitting the service over several
<big> WLCG issues coming from ROC reports </big>
AP ROC:No specific issue, but it might interest ATLAS to know that TAIWAN-LCG2 is currently working on a couple of problems:
Source File Preparation Problem from TAIWAN-LCG2 Storage Element (ATLASMCDISK Space Token).
File transfer problem at TAIWAN-LCG2_MCDISK in ASGC Cloud
ROC France: ATLAS pilot jobs at CCIN2P3
For several months now, ATLAS has been submitting a huge number of pilot jobs even when there is no task to be treated. Despite having notified French ATLAS production team of this, and attempting manual regulation of pilot job submission, 25% of ATLAS pilot jobs are still doing nothing once running.
Could ATLAS Production please adapt its execution engine to automatically regulate pilot job submission according to the number of tasks in their central queue?
<big>WLCG Service Interventions (with dates / times where known) </big>