28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
Maite Barroso Lopez
email@example.com Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610
NB: Reports were not received in advance of the meeting from:
ROCs: NorthernEurope, Russia
Tier-1 sites: BNL
VOs: Alice, LHCb
list of actions
Feedback on last meeting's minutes5m
<big> Grid-Operator-on-Duty handover </big>
From ROC DECH (backup: ROC France) to ROC Italy (backup: ROC Taiwan)
NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).
DECH: These three sites received a "final remainder" that the COD requests information about the progres on pending issues:
If there is no reaction next week, the COD team should escalete the tickets to the operations meeting.
France: 4 sites are requested to attend the Weekly Operations Meeting :
- (SU-GRID) https://gus.fzk.de/pages/ticket_details.php?ticket=24254
No entries except COD people in the ticket .
No entries except COD people in the ticket since more than one month .
No entries except COD people in the ticket since 4 weeks at least.
The monitoring of the node has been disabled since few time but no
explanation , and no downtime .
No entries except COD people in the ticket since 2 weeks
More generally ROCs and Sites must agree about a duty turn . I agree that
lots of people are in vacation but the ROCs must assumed a minimum of
presence . Most of tickets are in the same state during 2 weeks .
France: There are also still problems of synchronization between GOC DB and SAM:
<big> PPS Report & Issues </big>
PPS reports were not received from these ROCs:
New updates will be announced to PPS tomorrow, including a patch for FTS 2.0 (to be applied only by Tier-1s) and a new version of the gfal client.
Issues from EGEE ROCs:
<big> Phase out of LCG-2_7_0</big>
There are still a few production sites publishing the OS version as LCG-2_7_0.
After more than one year running gLite releases, it is time to phase it out and officially stop the support.
For this we would like to request all remaining sites to upgrade in the coming month, deadline by the end of August.
The list of sites still publishing LCG-2_7_0 is available from Gstat. There are 14 sites as of this morning.
Min has kindly provided the details about how this is extracted, here you have them so you can better understand the results shown:
GStat downloads all the GlueHostApplicationSoftwareRunTimeEnvironment
attributes for the entire site of all present GlueSubClusterUniqueID entries.
Then tries to find the newest release of glite or lcg.
This method only provides on version result for the site.
This can then be difficult to find clusters within a site that has not upgraded.
Some nodes do not show up is because this information gets aggregated for each site and lost at the end.
So sites like RAL with several CEs, will only have one version result from GStat, even though each CE has different versions.
<big> Migration to SL4 WNs </big>
AP ROC, ASGC T1: SLC4 migration for WN in progress. 200 Nodes have now been upgraded and another 300 nodes are still remaining.
DECH ROC: SL version 4 (2 sites)
SWE ROC, PIC site: The most important issue concerning pic this week is the migration to SLC4. We have installed a new CE from scratch which points to a pbs queue with WN''s running SLC4. This CE at the moment is configured to support ops and dteam. We have opened as well the access to the sgm users from the vo''s cms, atlas and lhcb so that they can test the installation software. It has been really a tedious decision of what to include in the software directory for every VO. Different VO''s have different needs. Finally we have decided to define a "brand new" software directory and not to publish any old tag from the other production CE''s, nor copy old software. From the atlas point of view this approach is acceptable. They have already installed their latest version of the software in the ce-test and it seems it works fine. Instead from the cms point of view this is not efficient at all. We are still trying to reach a solution which will be ok for all the vo''s, or at least for lhcb, cms and atlas.
<big> EGEE issues coming from ROC reports </big>
UKI ROC: RAL: have hit the limit of the current hardware that is running the current RGMA registry, unfortunately to move it to new hardware will also require a change in ip address. We realise that this may require sites to change firewall rules. How much notice would be necessary to allow sites to prepare for this change?
<big> Tier 1 reports </big>
<big> WLCG issues coming from ROC reports </big>
Germany-Switzerland: Tier1 Report: Propose to revert to a common reporting template. + Definition of severity is needed.
PIC had an issue last week with new WNs installed, that were missing a library apparently leeded by the ATLAS sw: /usr/lib/libg2c.a. Installing the rpm gcc-g77 solved the issue, but we believe it would be very useful to avoid these issues that each VO expresses their "base installation requirements" in some standard way. For instance, having some meta-rpm like "atlas-requirements" that sets an rpm requirement on gcc-g77 would have been convenient. The same holds for the other VOs.
<big>WLCG Service Interventions (with dates / times where known) </big>
The various client tools (FTS, GFAL, lcg_utils) have been enhanced to support SRM v2.2. During certification and testing, some bugs have been found. More details on the schedule and the feature list will be provided at future operations meetings. See also last week'sLCG ECM.
About the OS publication (discussed last meeting http://indico.cern.ch/conferenceDisplay.py?confId=18760, here http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_OS_name), we would have a clarification to understand the differences between the 32/64 bit publication.
(CERN / NIKHEF)
Job processing: Focus of the week was on 1) 'Spring07' tails clean-up: going on.. ; 2) 1st half CSA07 requests termination (60.5 Mevts, of which 59.2 merged; production performances is: an average production rate of ~42M evts/Month (steady), <job-slots usage>: 4600 (+43% compared to Spring07 GEN-SIM) with regular values >5000 and best in 24h ~6500); 3) 2nd half CSA07 requests assignment (~ 51 Mevts will follow). In MC production, the merging steps encountered problems with some massive lost of produced (unmerged) events which apparently cannot be recovered at CERN due to Castor issues: being followed up.
Data Transfers: continuing "production" transfers, i.e. GEN-SIM data shipping to T1's; "test" transfers: LoadTest infrastructure converging into the Debugging Data Tranfers (DDT) program: the DDT Task Force gave its first report at last Integration/CSA07 meeting, they identified the first set of already-commissioned links, and they are reviewing the 'LoadTest sample population' procedure to increase the nb of T2->T1 which can actually be tested within the program.