WLCG Tier1 Service Coordination Minutes - 03 November 2011
Attendance
Action list review
Release update
Data Management & Other Tier1 Service Issues
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR 2.1.11-8 for all main instances; SRM 2.10-2 (2.11 on PPS); xrootd: 2.1.11-1 FTS: 5 nodes in SLC5 3.7.0-3; 7 nodes in SLC4 3.2.1 EOS -0.1.0/xrootd-3.0.4 |
CERNT3 disk space (ATLAS and CMS) is being moved to EOS (same use case). This instance will be closed soon (users already moved out). As announced the transparent upgrade to the latest version for thenameserver nodes has been perforemd (no functional improvements just clean up, better logging etc...). Tests on the Tape Gateway component (optimised tape handling with improved performance and stability for tape read/write) are continuing (internal instances). |
More tests of the Tape Gateway are forseen (presently only on "internal" instances). This will not go in production for the LHC experiments before 2012. |
ASGC |
CASTOR 2.1.11-5 SRM 2.11-0 DPM 1.8.0-1 |
1/11 2300 UTC - 2/11 1100 UTC: scheduled downtime for DC power construction which affected storage; extended as unscheduled till 3/11 1200 UTC to complete work |
7/11 1200 UTC - 8/11 1500 UTC: CASTOR upgrade |
BNL |
dCache 1.9.5-23 (PNFS, Postgres 9) |
None |
Nov 8-9 Upgrade to version 1.9.12-10, change from PNFS to Chimera |
CNAF |
StoRM 1.7.0 (Atlas) Storm 1.5.0 (other endpoints) |
|
Storm upgrade to 1.7.0 for CMS and LHCb is going to be scheduled in the next 2 weeks |
FNAL |
dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25 Scalla xrootd 2.9.1/1.4.2-4 Oracle Lustre 1.8.3 |
|
|
IN2P3 |
dCache 1.9.5-29 (Chimera) on core servers and pool nodes |
RAM added to the CHIMERA server (now 32 GB) on october 25th |
|
KIT |
dCache (admin nodes): 1.9.5-27 (ATLAS, Chimera), 1.9.5-26 (CMS, Chimera) 1.9.5-26 (LHCb, PNFS) dCache (pool nodes): 1.9.5-6 through 1.9.5-27 |
|
|
NDGF |
dCache 1.9.14 (Chimera) on core servers. Mix of 1.9.13 and 2.0.0 on pool nodes. |
|
|
NL-T1 |
dCache 1.9.12-10 (Chimera) (SARA), DPM 1.7.3 (NIKHEF) |
|
|
PIC |
dCache 1.9.12-10 (last upgrade to patch release on 13-Sep); PNFS on Postgres 9.0 |
|
|
RAL |
CASTOR 2.1.10-1 2.1.10-0 (tape servers) SRM 2.10-2 |
None |
Change to CASTOR info provider to correct wrong total tape capacity |
TRIUMF |
dCache 1.9.5-28 with Chimera namespace |
None |
None |
Other site news
CASTOR news
CERN operations and development
EOS news
xrootd news
dCache news
StoRM news
FTS news
- FTS 2.2.5 in gLite Staged Rollout: http://glite.cern.ch/staged_rollout
- FTS 2.2.6 released in EMI-1 Update 6 on Sep 1
- restart/partial resume of failed transfers
- FTS 2.2.7 cancelled, next version will be FTS 2.2.8
- FTS 2.2.8 is being installed on the CERN pilot service. Once validated it will be opened for experiment testing.
DPM news
LFC news
- LFC 1.8.2-2 in staged rollout
LFC deployment
Site |
Version |
OS, n-bit |
Backend |
Upgrade plans |
ASGC |
1.8.0-1 |
SLC5 64-bit |
Oracle |
None |
BNL |
1.8.0-1 |
SL5, 64-bit |
Oracle |
None |
CERN |
1.8.2-0 64-bit |
SLC5 |
Oracle |
Upgrade to SLC5 64-bit only pending for lfcshared1/2 |
CNAF |
1.8.0-1 |
SL5 64-bit |
Oracle |
|
FNAL |
N/A |
|
|
Not deployed at Fermilab |
IN2P3 |
1.8.2-0 |
SL5 64-bit |
Oracle 11g |
|
KIT |
1.7.4-7 |
SL5 64-bit |
Oracle |
Oracle backend migration pending |
NDGF |
1.7.4.7-1 |
Ubuntu 10.04 64-bit |
MySQL |
None |
NL-T1 |
1.7.4-7 |
CentOS5 64-bit |
Oracle |
|
PIC |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
RAL |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
TRIUMF |
1.7.3-1 |
SL5 64-bit |
MySQL |
None |
Experiment issues
WLCG Baseline Versions
Partial Downtime Management
See the slides of slot "Partial donwtime management" on the
T1SCM agenda.
Talk summary
What Pierre calls a "partial downtime" is a downtime on a service, such as a SE, that would eventually impact any job using this service.
Pierre presented the problem by taking the use case of last
CCIN2P3 downtime on dCache SE. This downtime led to the killing of many jobs. LHCb people consequently asked CC-IN2P3 for changing its management of such a downtime. In particular, it was asked to drain queue soon enough to avoid to kill jobs.
Actually, killed jobs are a loss of CPU time for both VOs and Sites. But closing CEs is unfair for the other VOs that are not using the in-downtime service. That would then lead also to a loss of CPU time for those VOs.
Without closing the CEs, solutions however exist that prevent each one from CPU loss:
- impacted VO may ban the site
- site may hold in queue the impacted jobs.
Each of those actions has to be taken soon enough to avoid jobs to be killed. At the time being, there is no official way to clearly announce what actions are taken to prevent collateral damages of partial downtimes.
Waiting for a better solution, Pierre's proposal (see option 2 in slides) is from the site side:
- to held jobs in queue soon enough
- to announce this actions by putting a warning downtime on the CEs
Discussions
LHCb reactions
- Joel: don't like the option 2, but prefer the missing "option 0". This option should be; the VO dedicated site contact announces the downtime to the VO. The VO takes the action accordingly. During the last scheduled downtime of IN2P3-CC, it was clearly an error of LHCb.
- Pierre: Option 2 doesn't mean that the usual announcements through the site contact won't be done anymore. The idea of option 2 is to track what will be done, is ongoing or has been done. If a VO shifter is wondering what goes wrong with LHCb jobs, he/she can know it by taking a look at the WARNING downtime. Take care that you cannot manage the operations only through the site contact, he/she can be off and even take a month of vacations.
ATLAS reactions
- Alessandro: actually, the WLCG TEG "operations" is already discussing the downtime issues. This is one of the various cases under considerations.
- Simone / Ikuo (?): agree with LHCb, don't like option 2. They prefer that site contact makes the announcement anyway. The VO knows better than the site which kind of jobs are submitted to the site. So the VO can better adapt itself to the service downtime, and then takes the best of the time left before the service downtime.
- Pierre: once again, the human announcements are not removed by option 2. The use of downtime declatarion is just to track the downtime and the possible collateral damages (loss of jobs).
Site reactions
- ?? of ?? site: Agree with Pierre. Actually option 2 is already what is applied.
Conclusion
Lack of time, the discussion was stopped and no conclusion could be produced.
Post meeting Pierre's conclusion
By waiting for WLCG TEG operations to provide the community with better solutions,
CCIN2P3 people need to have a clear procedure for dealing with the partial downtimes.
It was understood that VOs prefer to manage by their own the jobs until the start time of a service downtime without any expected draining action from sites before downtime start time.
It was also understood that VOs need to be informed soon enough of the service downtime to be able to regulate their production accordingly.
As jobs could be still running at downtime start time, VOs are aware that jobs running during the downtime may be aborted because of the service maintenance operation. If jobs are killed before the begining or after the end of the downtime, the site is responisble for the jobs loss, otherwise this is each concerned VO.
Consequently, new proposal from
CCIN2P3 is to manage the partial downtimes as below
- by making announcement of service downtime to the VOs asap through each VOs communication channel when possible, that mean when VO contact at site is available.
- by making announcement of service downtime to WLCG at the daily meeting when possible, that means when a site representative attends the meeting.
- by adding downtimes to the GOC DB:
- OUTAGE downtime on the operated services, if those services are known in the GOC DB (CVMFS is not).
- WARNING downtime on other services partially impacted with a clear description of the collateral damages, as for instance, killing/draining phase of jobs of VOs impacted by the service downtime. The VOs impacted must be clearly specified in the downtime comment.
The downtimes information should provide to the VOs a clear picture of site status before, during and after the downtime.
Note that with such a partial downtime procedure, the site will be probably seen as not functional with the current grid monitoring framework. This issue should be also addressed by WLCG TEGs.
Status of open GGUS tickets
There were 2 old tickets reported by CMS. Details in the relevant slides linked from the agenda. Decisions will be reflected in the tickets' history, see comments by
MariaDZ in
GGUS:69294 and
GGUS:71864 with today's timestamp.
Review of recent / open SIRs and other open service issues
Conditions data access and related services
- Status update on COOL validation on 11g servers
- As reported at previous meetings, the first tests have shown that a different execution plan with non-scalable performance (query times increase as IOVs are retrieved from larger tables) is obtained on 11g servers out-of-the-box, for the same exact SQL queries as on 10g servers.
- While many more tests are needed, it presently seems that adequate performance and scalability on 11g servers can only be obtained by forcing the use of the 10g query optimizer. If confirmed, the solution will be deployed in a new COOL release, while the issue will be reported to Oracle Support for further investigations.
- [Discussion: the 10g server will no longer be supported after June 2012. However the use of the 10g optimizer inside the 11g server software should normally continue to be supported: if necessary, this should also be clarified with Oracle support.]
- Status update on conditions access from ATLAS T0.
- The causes for the spikes of high load observed on the database servers, which result in the failures of many T0 jobs, are not yet understood and are being investigated.
- ATLAS in evaluating the use of the Frontier/Squid or CORAL server/proxy caching technologies to avoid direct Oracle connections in T0 jobs. Another meeting with the CORAL team was held today.
- As reported at previous meetings, during the first Frontier tests at T0 a discrepancy had been observed between physics results when retrieving conditions via Oracle or via Frontier. This is now understood as a data caching bug in the ATLAS muon software, which is being addressed.
Database services
- Experiment reports:
- ALICE:
- ATLAS:
- On Tuesday 1st of November ATLAS offline production database (ATLR) got stuck due to storage issue during broken disk replacement. In order to to fix the problem storage and database had to be restarted manually.
-
- CMS:
- On Tuesday 25th October CMS offline production database got stuck for about 20 minutes due to some library cache contention. Analysis done so far didn't reveal the root cause of the issue.
- All 3 production databases of CMS have been successfully patched with the latest security patches from Oracle on Thursday 3rd November.
- LHCb:
Site |
Status, recent changes, incidents, ... |
Planned interventions |
BNL |
Major network maintenance is scheduled from 7:30 AM EST through 3:00 PM EST on Monday November 7. Affecting all services provided by the US Atlas Tier-1 Facility at BNL. - Streams replication (propagation and apply process) will be stopped during this intervention. - Relocation of LFC, FTS and VOMS database RAC nodes to a new data center Nov 7 from 10:00 AM EST to 1:00PM EST. |
CNAF |
|
|
KIT |
|
|
IN2P3 |
|
|
NDGF |
|
|
PIC |
Nothing to report |
None |
RAL |
- Castor - HW problems (~21-22 Oct) which did cause the full castor to be down for few hours. The reason for the HW failure is still under investigation. - At the beginning of the week we also had some performance problem on Atlas Stager that have been fixed (thanks to Nilo for help). |
CPU Oct date not defined yet. |
SARA |
Nothing to report |
None |
TRIUMF |
Nothing to report |
None |
AOB
--
JamieShiers - 21-Oct-2011