28-R-15 (CERN conferencing service (joining details below))
28-R-15
CERN conferencing service (joining details below)
Description
grid-operations-meeting@cern.ch Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
GGUS representatives
VO representatives
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
d)
<big> EGEE issues coming from ROC reports </big>
ROC France: CERN Trusted Certification Authority published old CRL twice during the week. This was due to a CERN CA web site update.
Such a problem is quite tricky as failures come from everywhere (SAM test failed, some users complain, Atlas DDM production went wrong). In order to find the problem out, you have to cross those data. So, perhaps:
It would be interesting to monitor CRL validity of official CAs, and keep its history
Ask CA adminitrator to check the CRL validity after each update.
Reminder: When a SAM critical test is going wrong, this is obviously good to quickly solve the problem, but please don t forget to announce asap the problem to prevent people from wasting time with misleading alarms.
Several of our sites had problems with site availability this week following the CA root certificate update. Things have now returned to normal but we are carrying out a post mortem to understand where things could have worked better.
3
WLCG Items
a)
<big> WLCG issues coming from ROC reports </big>
No items this week.
b)
<big>WLCG Service Interventions (with dates / times where known) </big>
GOG-Singapore would like to decommission their site by June 2, 2008. The hardware and services at the site will be shutdown permanently. Please migrate data that is still needed by your VO before the site is disabled.
The site currently supports the following VOs: alice, atlas, lhcb, cms, biomed, dteam and ops
CYFRONET-IA64: We are going to shut down CYFRONET-IA64 completely at the end of May 2008. Please take care of your data you may have on our classic SE: ares03.cyf-kr.edu.pl.
INFN-FIRENZE: Classic SE grid002.fi.infn.it is planned to be removed from production the 15th June. Please backup your data before that date.
On of the main focus of last week was T1 workflows. The first was stable, some Castor issues (mostly GC-related) addressed and fixed. On the second, we had reprocessing and skimming jobs running at T1 sites. ASGC: Pretty impressive performances; no problems found. CNAF also OK, running up to ~600 skim jobs in parallel. FNAL is running all kind of processing jobs since days, up to 3.4K running in parallel. FZK running processing with some issues reading the input RAW data, skim jobs also somehow slower than other T1's. IN2P3 is clearing out its backlog, more processing jobs will come. PIC has much processing to go, total number of slots used driven by the number of running skim jobs, some issues there are being investigated with the input/help from other T1's. RAL using all queues, some jobs killed by some CEs because exceeding the maximum CPU time on them, now restricted to use the long queue, and things got better.
Interesting week with transfers, especially in T0-T1 together with other VO's, and in T1-T2. Data transfer tests stable in the T0-T1 routes, and will continue until the end of the challenge. T1-T2 going on with link rotation, details at https://twiki.cern.ch/twiki/bin/view/CMS/CCRC08-DataTransfers . Tests ramping down in T1-T1 routes.
at the T0, repacker tests in progress. More info will be collected during this week.