WLCG-OSG-EGEE Operations meeting
→
Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))
28-R-15
CERN conferencing service (joining details below)
Maite Barroso Lopez
(CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
GGUS representatives
VO representatives ROCs:
Tier-1 sites: INFN
VOs: Atlas
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0157610
OR click HERE
NB: Reports were not received in advance of the meeting from:
-
- 1
-
2
EGEE Items
-
a) <big> Grid-Operator-on-Duty handover </big>From ROC Russia (backup: ROC UK/I) to ROC Italy (backup: ROC France)
Tickets:
- The last week problem is still here - some sites show the same problems:
Time to Match History : http://goc02.grid-support.ac.uk/cgi-bin/rb.py?RB=lapp-rb01.in2p3.fr
Publication Date (UTC) : Wed, 18 Apr 2007 09:35:01 +0000
/opt/edg/bin/edg-job-submit output :
JobID : None
Selected Virtual Organisation name (from --config-vo option): ops
**** Error: API_NATIVE_ERROR ****
Error while calling the "NSClient::multi" native api IOException: Unable to connect to remote (lapp-rb01.in2p3.fr:7772)
**** Error: UI_NO_NS_CONTACT ****
Unable to contact any Network Server
-------------------------------------------------
Time to Match History : http://goc02.grid-support.ac.uk/cgi-bin/rb.py?RB=rb-fzk.gridka.de
Publication Date (UTC) : Wed, 18 Apr 2007 13:35:07 +0000
/opt/edg/bin/edg-job-submit output :
JobID : None
Selected Virtual Organisation name (from --config-vo option): ops
Connecting to host rb-fzk.gridka.de, port 7772
Logging to host rb-fzk.gridka.de, port 9002
**** Error: API_NATIVE_ERROR ****
Error while calling the "edg_wll_RegisterJobSync" native api Unable to Register the Job:
https://rb-fzk.gridka.de:9000/bWPNXoGJ9qvNzOfjRk7o5w
to the LB logger at: rb-fzk.gridka.de:9002
Connection refused (edg_wll_ssl_connect())
------------------------------------------------- - GStat of some sites (RWTH-Aachen,BEgrid-ULB-VUB,GR-04-FORTH-ICS,CSCS-LCG2, UNI-FREIBURG) demonstrates the warning:
Service Entry Check: warn
Service with incorrect versions found:
ID: httpg://grid-srm.physik.rwth-aachen.de:8443/srm/managerv1
Type: srm_v1
Vers: 1.1.1
Service with bad SRM service type found
- A failure of HDD on pps-wms.cern.ch (aka lxb2092.cern.ch) - WMS at CERN_PPS site brings appearance of problems with a number of PPS sites.
- The last week problem is still here - some sites show the same problems:
-
b) <big> PPS reports </big>PPS reports were not received from these ROCs: Italy, Asia Pacific
- * PPS-Update 27 released to the PPS. This contains:
- patch #1118 lcg-vomscerts-4.4.1 has correct cert for biomed/egeode
- patch #1115 New version of lcg-info with support for VOViews, sites and services
- patch #1110 Dcache 1.7.0-34 upgrade with GridFTP bug fixes
- patch #1108 glite-yaim 3.0.1-12 5 => This version of YAIM enables DGAS logging on the LCG CEs.
- Significant issue found in SL4 natively complied WN (gridFTP ls causes segmentation fault)
- A meeting with all PPS sites (VRVS or phone conference)is being scheduled. The temptative date is:
Thursday 03 May 2007
from 15:00 to 16:3
The preliminary agenda is available at http://indico.cern.ch/conferenceDisplay.py?confId=15191 - SRM-2 testing in PPS
Although SRM-2 is not certified yet, experiments are requesting the PPS to give them support for initial testing of SRM-2 capabilitiies.
A number of new sites already running the new SRM-2 are going to join PPS.
This will force a re-organization of the data management in PPS- New sites joining and why
A number of new sites already running the new SRM-2 in the context of the SRM-2 "pilot" are going to join PPS.
This will force a re-organization of the data management in PPS (e.g. end-points to be published and updated in FTS).
Sites willing to volunteer to a pre-installation of their SEs with SRM-2 are welcome.
Sites will be also asked to volunteer to declare SRM-2 SEs as their "Close SE" - HEP VO specific testing
The SRM-2 testing concerns for the time being only HEP VOs.
Sites mainly dedicated to serve non-HEP VOs (e.g. Biomed, Diligent), although welcome to join the exercise, may found it useful to call them out in order to avoid conflicts
In that case they would need to stop supporting the HEP VOs - Installation of 'uncertified' software
There is no guarantee, so far, that the SRM-2 will be certified before this test activity starts.
As usual we will not ask sites to install uncertified software.
However, if sites are willing to do it in this case, and if it is compatible with any other use currently done of the storage resources of PPS, they are welcome - Data in the catalogs to be modified: Conflicts?
The migration of the catalogs is not reversible. The migration scripts are meant for use in production.
Experiments will know, that data created in the PPS catalogs during the exercise are going to be "lost" afterwards.
We have to check is there is any showstopper to the migration of the existing catalogs in PPS. - Configure CE with SRM-2 SEs as 'close SE'. Volunteers? As long as the list of end-points is not available we ask here only for an expression of interest
- New sites joining and why
- Issues coming from the ROCs
- A roadmap for gLite MW in general will assist us to plan ahead [SEE ROC]
Speaker: Nicholas Thackray (CERN) - * PPS-Update 27 released to the PPS. This contains:
-
c) <big> Decision needed on moving forward with SL4 WN </big>There is a bug in the native SL4 WN, but the upgrade path from the 'interim' WN (SL3 made compatible with SL4) is very difficult. Given the circumstances, how should we move forward?
RECOMMENDATION: Make the interim WN available to production sites, with a clear message regarding the upgrade problems and also the timelines for the native SL4 WN, and leave it up to each site to decide how they will handle it.Speaker: Nicholas Thackray (CERN) -
d) <big> EGEE issues coming from ROC reports </big>
- (ROC CERN, TRIUMF): SAM still handles timezones incorrectly. Maintenance on Fri 20th scheduled for 14:00 - 16:00 UTC but SAM show maintenance incorrectly at 08:04 UTC and in error at 14:02 UTC, i.e. wrongly during our maintenance.
- (ROC CERN, FNAL): 1. We set up a 2nd lcg gateway for redundancy. But if either goes down, SAM flags us as being down, thereby defeating the purpose of the 2nd gateway. Of course we are still operational, only SAM is marking us incorrectly. How can this be improved? I was told CERN runs multiple gateways, how do they handle this? 2. We need to split the cmswnNNN accounts on the 2 gateways since they operate independently.
- (ROC France): Within the relocatable distribution of WN/UI, check_crl script is not relocatable (GGUS #20970) ANSWER: ticket submitted 19/04 and assigned to the Installation and Configuration/New Release support unit. A bit more of patience before we escalate it.
- (ROC France): Please to notice that the number of wrong SAM test failures is decreasing. Congratulations.
- (ROC France): Announcement: A regional top bdii has been put in production. For now, it is only used by the T1 for a while to check the load, but it will be proposed afterwards to all french sites.
- (ROC DECH): Announcement: There is a planned outage of site UNI FREIBURG at the 2nd May from 5:00 UTC until 11:00 UTC.
- (ROC SEE): The latest update 21 to gLite introduced a new version of yaim. It has some new features which is very positive, but in deployment we encountered excessive problems due to the introduction of special pool accounts for prd and sgm users (earlier just one account for each of these) and new groups for them. Although advertised at the very end of new yaim guide on twiki, this has profound effects: if these new accounts are introduced, this must be done on all nodes, otherwise people mapped to one of these accounts will have problems trying to access local resources on other nodes where such accounts do not exist.
Specifically: release notes stated that reconfiguration is needed just for lcg-CE, lcg-CE_torque, and glite-CE, but in fact you need to introduce new accounts on all WNs at the same time. This is the list of GGUS ticket we crated so far:
https://gus.fzk.de/pages/ticket_details.php?ticket=20941
https://gus.fzk.de/pages/ticket_details.php?ticket=20942
https://gus.fzk.de/pages/ticket_details.php?ticket=21044
To conclude, I would say that release note missed to mention some important things yet again, and that it can badly affect VOs that massively use prd or sgm accounts.
-
-
3
WLCG Items
-
b) <big> LHCb service </big>New problem with dCache SEs.
The problem was first discovered when trying totransfer DSTs from disk only storage (d1t0) at our Tier1s to CERN. It was observed that many files from SARA and PIC were terminally failing with the error:
"Transfer failed. ERROR the server sent an error response: 425 425 Cantopen data connection. timed out() failed." ,br> Regardless ofthe number of retries attempted the files failed with the same error. Whenchecking the files on the SRM (using srm-get-metadata) the SRM showed that thesefiles were not staged i.e. isCached = false.
This was clearly a problem and the relevant files given to the sites for further investigation. At PIC and SARA these files were confirmed to reside in '/pnfs' (and as such visible by the SRM) but not on a disk pool. Since these files were not backed up, they bec onsidered lost.
Further investigation is ongoing at PIC where a '/pnfs' to disk pool consistency check is being performed (although this operation is extremely heavy and it is more than a week scripts are running).
These files were d1t0.
Over this weekend further files were discovered at GRIDKA, SARA and RAL (provided to the sites this morning) which look to be suffering from the same problem (i.e. are registered in '/pnfs' but can't be brought online).
This time the files are supposed to be in backed up storage (d0t1) and so in principle the files should be recoverable. But, after three days of attempting to stage these files over the weekend (using LHCb'scentral stager service), these files haven't been available. Attempts to staget hese files the hard way (i.e. attempting to copy the files out) have also failed.
It is possible that the same problem affecting the disk only files alsoaffected these files on the disk cache before being migrated to tape.
With both d1t0 and d0t1 files seeming affected by this problem it is hard to assess (on the experiment side) which files are affected by this bug.Speaker: Dr roberto santinelli (CERN/IT/GD) -
c) <big> WLCG issues coming from ROC reports </big>
- (ROC ???): ???
-
d) <big>Upcoming WLCG Service Interventions (with dates / times where known) </big>Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
- Downtime announcement : The RAL-LCG2 Castor will be down for upgrade on the 30th April and the 1st May, this will affect the 6 SRMs ralsrm[a-f].rl.ac.uk.
Time at WLCG T0 and T1 sites.
-
e) <big>FTS service review</big>
-
Please read the attached report.
- FTS report index - status by site and by VO
- Transfer goals - status by site and VO
- Transfer Operations Wiki
Speaker: Gavin McCance (CERN) -
f) <big> ATLAS service </big>See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.Speaker: Kors Bos (CERN / NIKHEF)
-
g) <big>CMS service</big>-- General: last week there was a CMS Offline/Computing workshop, attracting most of the attention.
-- Job processing: MC production in progress. Nothing to report, apart from some left-overs of transfers to CERN stil to be finished (mostly site problems, not FTS problems).
-- Data transfers: PhEDEx was off due to DBS-1 -> DBS-2 migration. Last week was week-5 of Cycle-2 of the CMS LoadTest07 (*), and it was a suspension week. Activity will restart as soon as PhEDEx is back up (updated plan states: Monday), focus will be on T1<->T2 regional and non-regional routes. Planning of PhEDEx/FTS2.0 in progress.
[*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY) -
h) <big> ALICE service </big>Nothing special to report for alice, just in case sites need anything and require anything to Alice.Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
-
i) <big> WLCG Service Coordination Issues </big>
Multi-VO Tier0-Tier1 transfer tests. The results of the previous tests (week of March 26th) show good overall daily / weekly transfer rates for both ALICE and CMS.
These have to be repeated including (at least) ATLAS, who has significantly higher rates (~1GB/s out of CERN to all Tier1s, not including the current increased event sizes).
The earliest that such a combined test can be organised is ~end May - more details will follow as they are established.
Speaker: Jamie Shiers / Harry Renshall
-
4
OSG Items
- Item 1
- 5
-
6
AOB
- ???