grid-operations-meeting@cern.ch Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
GGUS representatives
VO representatives
VRVS "Twister" room will be available 15:30 until 18:00 CET
minutes
16:00
→
17:25
WLCG-OSG-EGEE Operations Meeting28-R-15
28-R-15
16:00
Feedback on last meeting's minutes5m
16:05
EGEE Items20m
<big> Grid-Operator-on-Duty handover </big>5m
From GermanySwitzerland ROC (backup: CERN) to SouthEast Europe ROC (backup: France ROC)
Tickets:
OPENED 24
1ST MAIL 23
2nd MAIL 16
QUARANTINE 13
SITE OK 44
UNSOLVABLE 1
Notes:
There are still quite some sites failing replica tests with timeouts. (see also last handover report). One possible reason seems to be configured top-BDII too slow and/or network problems.
SFTs no longer updated on lcg-sft.cern.ch since last Friday (new results can now be found at lcg-sam.cern.ch!)
Sites specifics
- NSC-BLUESMOKE "couldn't get the glite-ce to work" (Job list match fails)
No further information about status. lcg-ce ok, perhaps take glite-ce out of monitoring until
fixed.
- USCMS-FNAL-WC1 asked for a (quite) long delay of tickets (rm problems and ops-vo)
<big> EGEE issues coming from ROC reports </big>15m
Reports were not received from these ROCs: All reports received.
What are the plans to migrate the gLite software to SLC 4? We'd be happy to take advantage of new kernel version performance and support for modern hardware. (CentralEurope ROC)
A major problem with downtimes: GRIF is a site split accross different subsites. When adding a downtime on 1 CE/node in GRIF, the whole site is affected. This means that no job is going on other GRIF CEs, which is not what we could infer when adding the downtime on this only one node. The only way for GRIF to update nodes without affecting all other subsites seems to not declare the downtime at all... wich is not desirable either. Could the tools (GOC DB, SAM-CE and others) be updated, so that a partial site downtime is not seen as a global downtime? (France ROC)
SFT/SAM: SFTs not running regularly any more on https://lcg-sft.cern.ch , additionally some site specific SFT framework problems (checked for region DECH, see GGUS Ticket #12454). SAM framework is not substituting old SFT framework sufficiently yet in terms of completely representing the production environment. One reason for this might be failures at the sites, but there are several obstacles due to the migration. The transition phase of both test frameworks should last long enough for sites to get used to the new SAM framework. (DCG ROC)
Maintaining persistent MW services: There is no recipe yet for shutting down and bringing up persistent middleware services like RB and SE. We suggest that the deployment group comes up with a concept of how these services should be maintained in a controlled way with as less as possible affect on the users jobs. Such a maintenance procedure is found to be an essential part of any middleware component for a production environment. (DECH ROC)
Last week the INFN-GRID Release Team discovered a change in the URLs of the LCG-gLite metapackage lists. This was before
http://glite.web.cern.ch/glite/packages/R3.0/$VERSION/doc/rpm_list/
and this the current one
http://glite.web.cern.ch/glite/packages/R3.0/deployment/${ge}/${rpm_version}/${ge}-${rpm_version}.rpm.list.txt
We ask if it's possible to be informed of such changes, involving more the release internals and developement than users. Other regions need this? (Italy ROC)
8443 DPM/R-GMA port conflict. INFN-GRID Release Team has locally implemented the modifications to yaim to change the DPM port. If needed they can send it to the "Mother-Release" for inclusion. (Italy ROC)
Can the RC report expiry time be extended to include at least some of the weekend if not all? (UK/I ROC)
16:25
OSG Items5m
FNAL would like to investigate adding open source databases to FTS. In order for us
to properly evaluate this, we want to check out the FTS client and server code from the CVS repository. In which CVS repository is this code stored?
Piotr updated us with SFT testing status.
In the next week we hope to run SFT's on our
Development site.
16:30
WLCG Items35m
<big> WLCG SC report and upcoming activities </big>15m