28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
firstname.lastname@example.org Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0148141
OR click HERE (Please specify your name & affiliation in the web-interface)
ROC France:INFORMATION IN2P3-CC: Central LFC for Biomed VO is currently overloaded due to a growth of Biomed activity. Even if the hardware was upgraded in emergency on Friday the problem is still there.
The problem might be due to some limitations in the number of simultaneous connections between the LFC and the Oracle DB. We will contact LFC support to find a good (and scalable) solution.
Sorry for the inconvenience.
ROC UK/I: A Biomed user's activity has caused site instabilities by repeatedly trasfering the same 2.8GB file to WNs across EGEE from a single UK site SE. After ticketing the user they produced more replicas but there is concern about this data distribution model and the bandwidth stress. For a related GGUS ticket see: https://gus.fzk.de/ws/ticket_info.php?ticket=43489. The user responded quickly. We may be seeing signs of the limit of the standard submission approach/model: "We are submitting theses jobs with the native EGEE command glite-wms-job-submit . These grid jobs are then accessing the 2.8GB data file through the command lcg-cp . So we didn't decide neither where the jobs are scheduled nor which file-replicate is used by these jobs. The EGEE middleware is deciding." Because of the I/O limitations the Biomed jobs are often quite inefficient.
ROC UK/I: UKI-NORTHGRID-LANCS-HEP saw a problem with a recent WN update: GGUS 43473 . The ticket seems to bounce around without anybody really knowing how to help! The point to note is that it is likely a site problem but the site/ROC has struggled to understand the problem as it (looks like it) requires middleware expert help. The site will try a reinstall with 64-bit gLite to try to remove the 64/32-bit incompatibilities but no real understanding of the problem has happened.
ROC UK/I: Site availability does not take into account SRM V2 systems. As a result the overall RAL availability is dependent on a dcache service which is no longer considered a front line service. SRM V2 not being in the overall availability figures is a problem with the monitoring not the site. Update The WLCG Management Board decided on Tuesday to use SRMv2 in the availability calculations as of December (in lieu of the SRMv1 tests). This will be discussed with the EGEE ROC Managers to ask them to ratify this.
ROC UK/I: On the topic of SAM, has there been any progress on centrally identifying common problems seen in SAM? On 19th November from 18:00-21:00 UK time a number of sites saw the same (top-level BDII?) problem. It would save much time if these errors could be automatically flagged as possibly due to an offsite problem.
Plots for Biomed activity
<big> Java Bouncy Castle problems </big>
Extract from broadcast: A few days ago jpackage updated bouncycastle to version 1.41. This version causes problems for several glite nodes as it places the jars in a new directory. The glite developers are currently working on patches to solve this issue. For the time being please make sure that your site DOES NOT UPGRADE to bouncycastle 1.41.
Node types affected by this problem:
<big> WLCG issues coming from ROC reports </big>
Nothing this week.
<big>WLCG Service Interventions (with dates / times where known) </big>