LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>Tier1ServiceCoordination>WLCGTier1ServiceCoordinationMinutes111103 (2011-11-16, MassimoLamanna)

EditAttachPDF

WLCG Tier1 Service Coordination Minutes - 03 November 2011

Attendance

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site	Status	Recent changes	Planned changes
CERN	CASTOR 2.1.11-8 for all main instances; SRM 2.10-2 (2.11 on PPS); xrootd: 2.1.11-1 FTS: 5 nodes in SLC5 3.7.0-3; 7 nodes in SLC4 3.2.1 EOS -0.1.0/xrootd-3.0.4	CERNT3 disk space (ATLAS and CMS) is being moved to EOS (same use case). This instance will be closed soon (users already moved out). As announced the transparent upgrade to the latest version for thenameserver nodes has been perforemd (no functional improvements just clean up, better logging etc...). Tests on the Tape Gateway component (optimised tape handling with improved performance and stability for tape read/write) are continuing (internal instances).	More tests of the Tape Gateway are forseen (presently only on "internal" instances). This will not go in production for the LHC experiments before 2012.
ASGC	CASTOR 2.1.11-5 SRM 2.11-0 DPM 1.8.0-1	1/11 2300 UTC - 2/11 1100 UTC: scheduled downtime for DC power construction which affected storage; extended as unscheduled till 3/11 1200 UTC to complete work	7/11 1200 UTC - 8/11 1500 UTC: CASTOR upgrade
BNL	dCache 1.9.5-23 (PNFS, Postgres 9)	None	Nov 8-9 Upgrade to version 1.9.12-10, change from PNFS to Chimera
CNAF	StoRM 1.7.0 (Atlas) Storm 1.5.0 (other endpoints)		Storm upgrade to 1.7.0 for CMS and LHCb is going to be scheduled in the next 2 weeks
FNAL	dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25 Scalla xrootd 2.9.1/1.4.2-4 Oracle Lustre 1.8.3
IN2P3	dCache 1.9.5-29 (Chimera) on core servers and pool nodes	RAM added to the CHIMERA server (now 32 GB) on october 25th
KIT	dCache (admin nodes): 1.9.5-27 (ATLAS, Chimera), 1.9.5-26 (CMS, Chimera) 1.9.5-26 (LHCb, PNFS) dCache (pool nodes): 1.9.5-6 through 1.9.5-27
NDGF	dCache 1.9.14 (Chimera) on core servers. Mix of 1.9.13 and 2.0.0 on pool nodes.
NL-T1	dCache 1.9.12-10 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)
PIC	dCache 1.9.12-10 (last upgrade to patch release on 13-Sep); PNFS on Postgres 9.0
RAL	CASTOR 2.1.10-1 2.1.10-0 (tape servers) SRM 2.10-2	None	Change to CASTOR info provider to correct wrong total tape capacity
TRIUMF	dCache 1.9.5-28 with Chimera namespace	None	None

Other site news

CASTOR news

CERN operations and development

EOS news

xrootd news

dCache news

StoRM news

FTS news

FTS 2.2.5 in gLite Staged Rollout: http://glite.cern.ch/staged_rollout
FTS 2.2.6 released in EMI-1 Update 6 on Sep 1
- restart/partial resume of failed transfers
FTS 2.2.7 cancelled, next version will be FTS 2.2.8
FTS 2.2.8 is being installed on the CERN pilot service. Once validated it will be opened for experiment testing.

DPM news

DPM 1.8.2-3 in staged rollout
- fast dpm-drain
- filesystem selection algorithm configurable by admin
- support for central banning (Argus)
- Allow definition of the number of threads for DPM and SRM servers at startup time
- https://savannah.cern.ch/patch/?5005
- https://savannah.cern.ch/patch/?5006
Monthly releases of new unstable components can be followed on the blog: https://svnweb.cern.ch/trac/lcgdm/blog
- This covers NFSv4.1, WebDAV, Nagios, Catalogue synchronisation & 'perfsuite'.
- Latest release: https://svnweb.cern.ch/trac/lcgdm/blog/unstable-release-1111

LFC news

LFC 1.8.2-2 in staged rollout
- fix for read-only replica operation (LHCb)
- support for central banning (Argus)
- https://savannah.cern.ch/patch/?5003
- https://savannah.cern.ch/patch/?5004

LFC deployment

Site	Version	OS, n-bit	Backend	Upgrade plans
ASGC	1.8.0-1	SLC5 64-bit	Oracle	None
BNL	1.8.0-1	SL5, 64-bit	Oracle	None
CERN	1.8.2-0 64-bit	SLC5	Oracle	Upgrade to SLC5 64-bit only pending for lfcshared1/2
CNAF	1.8.0-1	SL5 64-bit	Oracle
FNAL	N/A			Not deployed at Fermilab
IN2P3	1.8.2-0	SL5 64-bit	Oracle 11g
KIT	1.7.4-7	SL5 64-bit	Oracle	Oracle backend migration pending
NDGF	1.7.4.7-1	Ubuntu 10.04 64-bit	MySQL	None
NL-T1	1.7.4-7	CentOS5 64-bit	Oracle
PIC	1.7.4-7	SL5 64-bit	Oracle
RAL	1.7.4-7	SL5 64-bit	Oracle
TRIUMF	1.7.3-1	SL5 64-bit	MySQL	None

Experiment issues

WLCG Baseline Versions

Release report: deployment status wiki page
WLCG Baseline versions: table

Partial Downtime Management

See the slides of slot "Partial donwtime management" on the T1SCM agenda.

Talk summary

What Pierre calls a "partial downtime" is a downtime on a service, such as a SE, that would eventually impact any job using this service.

Pierre presented the problem by taking the use case of last CCIN2P3 downtime on dCache SE. This downtime led to the killing of many jobs. LHCb people consequently asked CC-IN2P3 for changing its management of such a downtime. In particular, it was asked to drain queue soon enough to avoid to kill jobs.

Actually, killed jobs are a loss of CPU time for both VOs and Sites. But closing CEs is unfair for the other VOs that are not using the in-downtime service. That would then lead also to a loss of CPU time for those VOs.

Without closing the CEs, solutions however exist that prevent each one from CPU loss:

impacted VO may ban the site
site may hold in queue the impacted jobs.

Each of those actions has to be taken soon enough to avoid jobs to be killed. At the time being, there is no official way to clearly announce what actions are taken to prevent collateral damages of partial downtimes.

Waiting for a better solution, Pierre's proposal (see option 2 in slides) is from the site side:

to held jobs in queue soon enough
to announce this actions by putting a warning downtime on the CEs

Discussions

LHCb reactions

Joel: don't like the option 2, but prefer the missing "option 0". This option should be; the VO dedicated site contact announces the downtime to the VO. The VO takes the action accordingly. During the last scheduled downtime of IN2P3-CC, it was clearly an error of LHCb.

Pierre: Option 2 doesn't mean that the usual announcements through the site contact won't be done anymore. The idea of option 2 is to track what will be done, is ongoing or has been done. If a VO shifter is wondering what goes wrong with LHCb jobs, he/she can know it by taking a look at the WARNING downtime. Take care that you cannot manage the operations only through the site contact, he/she can be off and even take a month of vacations.

ATLAS reactions

Alessandro: actually, the WLCG TEG "operations" is already discussing the downtime issues. This is one of the various cases under considerations.

Simone / Ikuo (?): agree with LHCb, don't like option 2. They prefer that site contact makes the announcement anyway. The VO knows better than the site which kind of jobs are submitted to the site. So the VO can better adapt itself to the service downtime, and then takes the best of the time left before the service downtime.

Pierre: once again, the human announcements are not removed by option 2. The use of downtime declatarion is just to track the downtime and the possible collateral damages (loss of jobs).

Site reactions

?? of ?? site: Agree with Pierre. Actually option 2 is already what is applied.

Conclusion

Lack of time, the discussion was stopped and no conclusion could be produced.

Post meeting Pierre's conclusion

By waiting for WLCG TEG operations to provide the community with better solutions, CCIN2P3 people need to have a clear procedure for dealing with the partial downtimes.

It was understood that VOs prefer to manage by their own the jobs until the start time of a service downtime without any expected draining action from sites before downtime start time.

It was also understood that VOs need to be informed soon enough of the service downtime to be able to regulate their production accordingly.

As jobs could be still running at downtime start time, VOs are aware that jobs running during the downtime may be aborted because of the service maintenance operation. If jobs are killed before the begining or after the end of the downtime, the site is responisble for the jobs loss, otherwise this is each concerned VO.

Consequently, new proposal from CCIN2P3 is to manage the partial downtimes as below

by making announcement of service downtime to the VOs asap through each VOs communication channel when possible, that mean when VO contact at site is available.
by making announcement of service downtime to WLCG at the daily meeting when possible, that means when a site representative attends the meeting.
by adding downtimes to the GOC DB:
- OUTAGE downtime on the operated services, if those services are known in the GOC DB (CVMFS is not).
- WARNING downtime on other services partially impacted with a clear description of the collateral damages, as for instance, killing/draining phase of jobs of VOs impacted by the service downtime. The VOs impacted must be clearly specified in the downtime comment.

The downtimes information should provide to the VOs a clear picture of site status before, during and after the downtime.

Note that with such a partial downtime procedure, the site will be probably seen as not functional with the current grid monitoring framework. This issue should be also addressed by WLCG TEGs.

Status of open GGUS tickets

There were 2 old tickets reported by CMS. Details in the relevant slides linked from the agenda. Decisions will be reflected in the tickets' history, see comments by MariaDZ in GGUS:69294 and GGUS:71864 with today's timestamp.

Review of recent / open SIRs and other open service issues

Conditions data access and related services

Status update on COOL validation on 11g servers
- As reported at previous meetings, the first tests have shown that a different execution plan with non-scalable performance (query times increase as IOVs are retrieved from larger tables) is obtained on 11g servers out-of-the-box, for the same exact SQL queries as on 10g servers.
- While many more tests are needed, it presently seems that adequate performance and scalability on 11g servers can only be obtained by forcing the use of the 10g query optimizer. If confirmed, the solution will be deployed in a new COOL release, while the issue will be reported to Oracle Support for further investigations.
- [Discussion: the 10g server will no longer be supported after June 2012. However the use of the 10g optimizer inside the 11g server software should normally continue to be supported: if necessary, this should also be clarified with Oracle support.]
Status update on conditions access from ATLAS T0.
- The causes for the spikes of high load observed on the database servers, which result in the failures of many T0 jobs, are not yet understood and are being investigated.
- ATLAS in evaluating the use of the Frontier/Squid or CORAL server/proxy caching technologies to avoid direct Oracle connections in T0 jobs. Another meeting with the CORAL team was held today.
- As reported at previous meetings, during the first Frontier tests at T0 a discrepancy had been observed between physics results when retrieving conditions via Oracle or via Frontier. This is now understood as a data caching bug in the ATLAS muon software, which is being addressed.

Database services

Experiment reports:
- ALICE:
  - NTR
- ATLAS:
  - On Tuesday 1st of November ATLAS offline production database (ATLR) got stuck due to storage issue during broken disk replacement. In order to to fix the problem storage and database had to be restarted manually.
- CMS:
  - On Tuesday 25th October CMS offline production database got stuck for about 20 minutes due to some library cache contention. Analysis done so far didn't reveal the root cause of the issue.
  - All 3 production databases of CMS have been successfully patched with the latest security patches from Oracle on Thursday 3rd November.
- LHCb:
  - NTR

Site reports:

Site	Status, recent changes, incidents, ...	Planned interventions
BNL	Major network maintenance is scheduled from 7:30 AM EST through 3:00 PM EST on Monday November 7. Affecting all services provided by the US Atlas Tier-1 Facility at BNL. - Streams replication (propagation and apply process) will be stopped during this intervention. - Relocation of LFC, FTS and VOMS database RAC nodes to a new data center Nov 7 from 10:00 AM EST to 1:00PM EST.
CNAF
KIT
IN2P3
NDGF
PIC	Nothing to report	None
RAL	- Castor - HW problems (~21-22 Oct) which did cause the full castor to be down for few hours. The reason for the HW failure is still under investigation. - At the beginning of the week we also had some performance problem on Atlas Stager that have been fixed (thanks to Nilo for help).	CPU Oct date not defined yet.
SARA	Nothing to report	None
TRIUMF	Nothing to report	None

AOB

-- JamieShiers - 21-Oct-2011

Topic revision: r20 - 2011-11-16 - MassimoLamanna

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback