LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGServiceIncidents (2023-12-08, AlastairDewhurst)

EditAttachPDF

WLCG Service Incident Reports

WLCG Service Incident Report Guidelines

Site where the incident took place
Service area to which the incident related (Infrastructure, Middleware, DB, Storage or Network)
When the problem has been detected
How long it lasted
Service to which the incident related
Experiment(s) impacted by the incident and, if known, which experiment activities were affected
Report regarding the incident and problem resolution with a detailed time line
What has been done, if anything, to try and make sure the problem won't reappear

N.B. Downtimes / degradations are always "user visible" (which is what counts...)

2023
- Q4 2023
- Q2 2023
- Q1 2023
2022
- Q4 2022
- Q1 2022
2021
- Q3 2021
2020
- Q2 2020
- Q1 2020
2019
- Q3 2019
2018
- Q3 2018
- Q1 2018
2017
- Q4 2017
- Q2 2017
- Q1 2017
2016
- Q4 2016
- Q3 2016
- Q2 2016
- Q1 2016
2015
- Q4 2015
- Q3 2015
- Q2 2015
- Q1 2015
2014
- Q4 2014
- Q3 2014
- Q2 2014
- Q1 2014
2013
- Q4 2013
- Q3 2013
- Q2 2013
- Q1 2013
2012
- Q4 2012
- Q3 2012
- Q2 2012
- Q1 2012
2011
- Q4 2011
- Q3 2011
- Q2 2011
- Q1 2011
2010
- Q4 2010
- Q3 2010
- Q2 2010
- Q1 2010
2009
- Q4 2009
- Q3 2009
- Q2 2009
- Q1 2009
2008
Temp Area
External References

Service area is: Infrastructure, Middleware, DB, Storage or Network
SIRs are plotted in WLCG quarterly reports both by service area and by time to resolution (Total, > 96h, >24h)

2023

Q4 2023

Site	Service Area	Date	Duration	Service	Impact	Report
RAL-LCG2	Infrastructure	October 3rd - 5th	3d	CVMFS Stratum-1	Repository out of date, jobs failing across Grid	post-mortem

Q2 2023

Site	Service Area	Date	Duration	Service	Impact	Report
CERN-PROD	Infrastructure	Apr 22	3h	SSO + many dependent services	service outage	post-mortem

Q1 2023

Site	Service Area	Date	Duration	Service	Impact	Report
KIT	Middleware	Mar 29-30	26h	GGUS	service outage	ServiceIncidentReport_20230329.pdf

2022

Q4 2022

Site	Service Area	Date	Duration	Service	Impact	Report
CERN-PROD	Middleware	Oct 31	1 day	IAM-ATLAS	HTCondor CE job submission timeouts worldwide	WLCG_AuthZ_Meeting_-_ATLAS_IAM_Outage.pdf
RAL-LCG2	Network	Oct 17-28	11 days	all	outages and degradation	RAL-LCG SIR Network Outage October 2022

Q1 2022

Site	Service Area	Date	Duration	Service	Impact	Report
FZK-LCG2	Storage	Mar 18, 2022	3d	dCache and xrootd SE	Abrupt outage of all grid SEs due to network intervention	KIT_SIR_OnlineStorage_2022-03.pdf

2021

Q3 2021

Site	Service Area	Date	Duration	Service	Impact	Report
CERN-PROD	Middleware	Aug 20-21, 2012	6h	Rucio auth service	service unreachable from outside CERN, all ATLAS distributed computing activities stuck	RucioAuthSvcInc20210820
CERN-PROD	Middleware	July 9 and 25, 2021	several days	FTS-ATLAS	service dysfunctional, traffic redirected to BNL FTS	FTS service incident report

2020

Q2 2020

Site	Service Area	Date	Duration	Service	Impact	Report
CERN-PROD	Middleware	June 24-25, 2020	24h	CERN Grid CA OCSP service	CERN Grid CA certificates could not be used for job submission to CREAM CE instances at tens of sites, affecting the 4 experiments and, through the SAM tests, those sites	OTG:0057432 CERN_OCSP_incident_report.pdf
CERN-PROD	Databases	May 27, 2020	1 day + 5 days for ATLAS replication to T1 sites	many	Many site and experiment services affected	DB post-mortem 27 May 2020

Q1 2020

Site	Service Area	Date	Duration	Service	Impact	Report
CERN-PROD	Infrastructure, Storage, Middleware	Feb 20, 2020	9 hours	many	Many site services affected, all grid computing resources unavailable	CERNProdIncident200220

2019

Q3 2019

Site	Service Area	Date	Duration	Service	Impact	Report
CNAF	Infrastructure	Aug 6-21, 2019	15 days	all	all computing resources and data unavailable	SIR_CNAF_20190829.pdf

2018

Q3 2018

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	Storage	Jun-Sep 2018	3 months	EOS	instabilities; some data loss	EOS report
KIT	Storage	9-10 Aug 2018	1 day	dCache	Lost (at least) 270k files for CMS.	SIR.pdf
IN2P3-CC	Storage	Aug 2018	-	XRootD	110 TB of ALICE data lost due to RAID problem	SIR.pdf

Q1 2018

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	Databases	Feb 2018	5 days	LHCb	Service degraded	LHCb.pdf[SIR.pdf

2017

Q4 2017

Site	Service Area	Date	Duration	Service	Impact	Report
KIT	Tape Storage	Dec 2017		Tape Archive	4300 files lost total	SIR.pdf

Q2 2017

Site	Service Area	Date	Duration	Service	Impact	Report
KIT	Infrastructure	31 May 2017	6h	GGUS	service unavailable	SIR_201705.pdf

Q1 2017

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	Database though the problem rather related to network	21 Mar 2017	48h	CMSR	Phedex downtime in CNAF and Wisconsin	20170321_SIR_CERN_PHEDEX.pdf
KIT	Storage	12 Jan 2017	-	dCache/TSM	7185 ATLAS, 75 LHCb and 2 CMS files lost	KIT_SIR_Lost_files_after_TSM_DB_storage_crash.pdf

2016

Q4 2016

Site	Service Area	Date	Duration	Service	Impact	Report
TRIUMF	Storage	18 December 2016	-	dCache	Unrecoverable data loss	TRIUMF-dcs08lun0_incident_20161218.pdf
ASGC	Storage	18 Oct 2016	-	DPM	135k ATLAS files (20 TB) lost due to RAID failure	SIRondatalossinASGCinOct.2016.pdf
INFN-T1	Middleware	1 Oct 2016	3.5 days	CREAM	jobs had no valid proxy on the WN, particularly impacting LHCb	post-mortem-CNAF-CE-Problem-Sept-2016.pdf

Q3 2016

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	Middleware	15 Sep 2016	33h	LSF batch system, CREAM	jobs could not be submitted, strongly impacting ALICE and LHCb	https://twiki.cern.ch/twiki/bin/view/CMgroup/BatchServiceIncident150916

Q2 2016

Site	Service Area	Date	Duration	Service	Impact	Report
PIC	Storage	17 December 2015	-	Tape Storage	a T10KD drive writing off track made several files unreadable	SIR_PIC_ATLAS_T10KD_20160519.pdf
SARA	Infrastructure	30 June 2016	26 hours	Compute and storage	Outage	SURFsara_SIR_network_outage_30-6-2016.pdf

Q1 2016

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	Infrastructure	29 Mar	2 days	VOMS	ATLAS, CMS and LHCb affected, several experiment services affected, FTS transfers affected	Report

2015

Q4 2015

Site	Service Area	Date	Duration	Service	Impact	Report
IN2P3-CC	Network	3 Nov	1h	Network	The router connecting the site to the outside world broke and all external network connections stopped working	SIR-IN2P3-CC-network-2015-11-03-v3.pdf
CERN	Batch	5 Dec	6h	Batch services	loss of running jobs, degraded capacity	IncidentBatchWorkerNodes

Q3 2015

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	Infrastructure	9 Jul	2h	CVMFS	All CMS Jobs failed on WLCG	IncidentCvmfsCMS150709

Q2 2015

Site	Service Area	Date	Duration	Service	Impact	Report
FNAL	Storage	April 15	15 days	dCache	Unrecoverable data loss	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/uscmsT1_SIR_042015.pdf

Q1 2015

Site	Service Area	Date	Duration	Service	Impact	Report
SARA	Storage	January 15	25 days	dCache	Unrecoverable data loss	SURFsara_Service_Incident_Report_-_bw32-1_backplane.pdf

2014

Q4 2014

Site	Service Area	Date	Duration	Service	Impact	Report
IN2P3-CC	Network	November 26	1.6 hours	VOBoxes	Various internal services and VOBoxes were cut off the network	SIR-IN2P3-CC-network-2014-11-26-v0.pdf
CERN	Storage	October 11	5 hours	CASTOR	Outage: backend daemon of the SRM service stopped talking with the CASTOR database	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsSRMCMS20141011
CERN	Storage	October 14	4 hours	CASTOR	Outage: backend daemon of the SRM service stopped talking with the CASTOR database	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsSRMCMS20141016
KIT	Storage	September 30	-	Tape Storage	due to wrong tape markers loss of 424 files	KIT_SIR_Storage_20141023.pdf

Q3 2014

Q2 2014

Site	Service Area	Date	Duration	Service	Impact	Report
KIT	Network	Apr 1	3 weeks	Network	many job and data transfer failures for all 4 experiments, due to firewall and OPN overload by ALICE jobs	SIR-ALICE-KIT-overload-v2.pdf

Q1 2014

Site	Service Area	Date	Duration	Service	Impact	Report
RAL	Infrastructure	Mar 5	16h	GOCDB	topology and downtimes unavailable	GOCDB_Outage_5th_March_2014.doc

2013

Q4 2013

Site	Service Area	Date	Duration	Service	Impact	Report
KIT	Storage	Nov 18	-	tape archive	28 CMS files lost	A broken tape was spotted, but 28 of its files could not be found cached on disk or at other sites anymore.
CERN	Storage	Nov 5	-	EOS-CMS	78k files lost, 15 TB, 28 users affected	https://twiki.cern.ch/twiki/pub/EOS/IncidentsEOSCMSRecursiveRm20131105/20131105-EOSCMS-Service-Report.pdf (the incident reported is not considered a service incident)
ASGC	Storage	November 4	-	Disk Storage	Lost Data (approx 1M files, 140TB of data from ATLASDATADISK)	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_DATA_LOSS_SIR-NOV_2013.pdf
KIT	Storage	October 28th	-	Disk storage, tape archival	Lost data	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/KIT_SIR_Storage_20131028.pdf
NL-T1	Storage	October 24th	2 months	grid storage cluster	Unavailability + Data loss (45 files)	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Service_Incident_Report.pdf
CERN	Middleware	October 7	4h	VOMS	Proxy creation and renewal failures, large amounts of job and data transfer failures across WLCG	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentVOMSOct2013

Q3 2013

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	Infrastructure	Sep 18	8h	VOBOXes, LFC, FTS, DB	various central services of ATLAS, CMS and LHCb impaired, transfer failures, data access errors	https://twiki.cern.ch/twiki/bin/viewauth/PESgroup/IncidentSCSSet2013
TRIUMF	Storage	September 16	-	Disk Storage	Lost Data	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Storage_incident_report_at_TRIUMF_Sep-16-2013.pdf

Q2 2013

Site	Service Area	Date	Duration	Service	Impact	Report
BNL	Storage	June 21	-	Disk Storage	Lost Data	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Service_Incident_Report_for_BNL_Tier1-06-2013.pdf

Q1 2013

Site	Service Area	Date	Duration	Service	Impact	Report
ASGC	Storage	Mar 27	-	CASTOR	lost 55 files in Atlas MCTAPE	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC-SIR20130324-Atlas_file_lost.pdf
CERN	Infrastructure	Feb 21-22	16h	VOMS	significant number of job and/or data transfer failures for all experiments throughout WLCG	VOMS incident Feb 2013
CERN	Batch Computing	Feb 10	8h	Batch	Batch system was down (unavailable for users), then dispatched jobs too slowly	LSF Master Daemon Crash and Slow Dispatch Issue
CERN	Storage	Jan 22	8h	CASTOR	CASTOR DB overload causing transfer slowness, mainly affecting CMS	CASTOR DB loads
CERN	Infrastructure	Jan 19	9h	all services relying on grid certificates, at CERN and elsewhere	many grid services unavailable to many users, large number of jobs lost	CERN CA CRL incident

2012

Q4 2012

Site	Service Area	Date	Duration	Service	Impact	Report
PIC	Storage	Dec 10	-	dCache	LHCb tape deleted (2 unique files lost)	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20121210SIRPICLHCblostfilesontape2.pdf
GridKa	File Transfer, Storage Element	Nov 27th	20 hours	FTS, dCache, LFC, CondDB	German cloud down for transfers (FTS users)	KIT_SIR_StorageFTS_20121127.pdf
RAL	all	Nov 20	50h	all	T1 services unavailable	https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20121120_UPS_Over_Voltage
RAL	all	Nov 7	27h	all	T1 services unavailable, 166 ATLAS files lost	https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20121107_Site_Wide_Power_Failure
CERN	Storage	Oct 16	4h	CASTOR	CASTORCMS severely degraded due to unstable DB execution plan	IncidentsCMSOverload20121016
PIC	Storage	Oct 9	-	dCache	Accidental ATLAS data deletion	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20121009_PIC_SIR_ATLAS_deleted_files.pdf

Q3 2012

Site	Service Area	Date	Duration	Service	Impact	Report
CNAF	Storage	Sep 21-27	6d	StoRM	LHCb data unavailable and queue closed	SIR20120921.pdf
CERN	Storage	7 Sep	n/a	EOSCMS	accidental user deletion of 1PB of data	report pending
ASGC	Storage	July 29 - Aug 07	10d	CASTOR	ATLAS and CMS transfer efficiency to Taiwan degraded. T0 export stopped	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120729_SIR_ASGC_STAGERDB.pdf
CERN	Infrastructure	~all quarter	on-going	LSF	slow job submission critically affecting ATLAS T0. Dispatch issues affecting ATLAS T0.	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlowResponse2012 ongoing https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlowDispatch2012
IN2P3	Infrastructure	3-4 Jul	21h	CVMFS	ATLAS and LHCb job failures	SIR-IN2P3-CC-CVMFS-2012-07-03-v0.pdf
IN2P3	Storage	1-2 Jul	30h	dCache	job and transfer failures, batch on hold	SIR-IN2P3-CC-dCache-2012-07-01-v1.pdf

Q2 2012

Site	Service Area	Date	Duration	Service	Impact	Report
IN2P3	Network	29 Jun	4 h	Network	All outside connectivity lost	SIR-IN2P3-CC-network-2012-06-29-v0.pdf
IN2P3	Infrastructure	24 Jun	36 h	CVMFS at IN2P3	ATLAS and LHCb jobs crashed, dCache overload by CMS jobs	SIR-IN2P3-CC-CVMFSSquid-2012-06-24-v2.pdf
PIC	WNs	21 Jun	1 h	PIC Tier1 Computing	About 17% of the WN capacity switched off due to cooling incident	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120621_SIR_Cooling_Incident_at_PIC.pdf
CERN	Storage	18 Jun	~1h	CASTOR	c2atlas diskservers were not reachable for ~1h	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsRmNodeMisconfiguration20120618
CERN	Storage	5 Jun	1 h	CASTOR	communication problems and client timeouts	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsNameServerContention20120605
PIC	WNs	3-4 Jun	18 h	PIC Tier1 Computing	18h of service degradation: Number of cores reduced by 60% due to cooling incident	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120603_SIR_Cooling_Incident_at_PIC.pdf
CERN	DB	22 May	1.5 h	CMS online DB	1.5 hours of high luminosity data lost	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem22May12
CERN	Storage	22 May	5-40 min	CASTOR	~1k unavailable files after transparent DB intervention	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsDegradationDBIntervention20120522
CERN	Infrastructure	19-20 April	1 day	batch	batch system down	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchDown190412
CERN	Infrastructure	18-20 April	2 days	batch	ATLAS Tier-0 job submission system could not keep up with incoming RAW data	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlow180412
ASGC	Storage	11-12 April	24 h	CASTOR	hardware failure, DB crashed	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_SIR_2012-04-11.pdf
TRIUMF	All Tier-1 services	10-11 April	20 h	All Tier-1 services	Two site-wide power failures	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/TRIUMF-incident-report-april10-2012.pdf
CERN	Storage	4 April	1.5 h	CASTOR	Name Server stuck, 3 CMS files had to be rewritten	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCentralNSStuck20120404
CERN	Storage	2 April	many days (~10)	CASTOR	1 LHCb diskserver hardware issue (files unavailable, finally 3 file systems lost)	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsDiskOnlyDataLoss20120402

Q1 2012

Site	Service Area	Date	Duration	Service	Impact	Report
PIC	Storage	15-23 March	8 days	Disk (dCache)	ATLAS file loss due to RAID corruption (Adaptec 6445): 1269 files permanently lost	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/PostMortemTier-1ServiceIncidentRAIDCORRUPTIONAdaptec644515-03-2012.doc
PIC	Storage	8-13 March	5 days	Tape (Enstore)	LTO5 tape broken, 988 files temporarily unavailable, 1 file permanently lost	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120310SIRATLASlostfileonLTO5tapeG05918.doc
CERN (and probably others)	Infrastructure	20 Mar 2012	<=20hrs	GGUS	Some sites couldn't access GGUS web pages	https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRGGUSunreachable20120320
T0+T1s	DB	Q1	n/a	Database	Various	https://twiki.cern.ch/twiki/bin/view/DB/PhysicsDatabase11gUpgradeReport
PIC	All Tier1 services	22 Jan 2012	5 hours	All Tier1 services	Outage due to site poweroff caused by cooling incident	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120122SIRPowerandCoolingProblematPIC.pdf

2011

Q4 2011

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	Compute	17/18 Dec 2011	18 hours	CERN batch service	Batch service downtime (unavailable for users)	IncidentBatch171211
KIT	Storage	Dez 2011	3 Months	tape archival	2 lost files	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_Service_Incident_Report_12082011.pdf
KIT	Infrastructure	Nov 4-7	2.5 days	GGUS external interfaces	No ticket updates entered other ticketing systems including SNOW at the T0	https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRSNOWinterfacefailure20111104
RAL	Database (was Storage)	Oct 22-23	1.5 days	CASTOR DB	CASTOR down	https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111022_Castor_Outage_RAC_Nodes_Crashing
CERN	DB	Oct 11		GGUS alarms	GGUS alarm to IT-DB workflow	GGUSalarmToITDBworkflowPostmortemReport11102011
CERN	DB	Oct 11-12		ATLAS Offline (ATLR)	ATLAS Offline database (ATLR) high load	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem12Oct11
KIT	Network	Oct 6	24h	GGUS	Ticketing systems at the T0 & some T1s couldn't get GGUS updates.	https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRDNSfailure20111006

Q3 2011

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	DB	Sep 27	7.30h	CMS Offline	CMS offline production database stuck	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem27Sep11
BNL	DB	Sep 6	4.25h	Streams for conditions	Discrepancy in an Oracle database table	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_BNL_CONDB.pdf
IN2P3	Infrastructure	Aug 26	7.5h	CE	7500 job failures	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-PowerIncident-2011-08-26-v2.pdf
IN2P3	Infrastructure	Aug 15	19h	CEs	CEs at 100%, others at 85% degradation	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_CCIN2P3_15aug2011.pdf
CERN	DB	Aug 09	17h	CASTOR	CASTOR nameserver database overload	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsnameserverDBoverload09Aug2011
CERN	DB	July 29	Scheduled+2h	CASTOR	Upgrade-related problems with stager DB (ATLAS and CMS)	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsStagerDBUpgradeIssues04Jul2011
KIT	infrastructure	July 22	5d	GGUS	GGUS alarm emails not working	20110727GGUS_Service_Incident_Report.pdf
IN2P3	Databases	July 19	3d	LFC, FTS, VOMS, 3D, AMI	services unavailable, some data loss	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_CCIN2P3_19july2011.pdf
KIT	Storage	July 12	15d	ATLAS dCache	11k files lost	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir-kit-atlas-dcache-20110728.pdf
CERN	CASTOR	July 7th	4 h	CASTOR	Garbage Collector taking too long	https://twiki.cern.ch/twiki/bin/view/CASTORService/Incidentst1transferfull07Jul2011
CERN	DB	July 05	Scheduled+7h	CASTOR	Upgrade-related problems with stager DB (ATLAS and CMS)	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsStagerDBUpgradeIssues04Jul2011
CERN	DB	July 04	Scheduled+1h	CASTOR	Upgrade-related problems with stager DB (ATLAS and CMS)	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsStagerDBUpgradeIssues04Jul2011

Q2 2011

Site	Service Area	Date	Duration	Service	Impact	Report
CERN	CASTOR	June 26	8 h	CASTOR	CMS was unable to stage files back from tape	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCMSnostageout26Jun2011
PIC	Computing/Storage	June 10	5h	dCache PNFS	dCache namespace overload	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Post_Mortem_Tier-1_Service_Incident_dCache_PNFS_overload_10-June-2011_f.pdf
KIT	Storage	Jun 5	14 d	ALICE xrootd managed storage	3% of the files unreadable	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_lost_files_alice_20110526.pdf
CERN	Infrastructure	May 26	6 wks	KDC	high KDC load	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/KDC-SIR.pdf
CERN	CASTOR	May 24	6 h	DB overload on the CASTOR CMS instance	Progressive degradation gradually affecting 80% of the servie	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCMSDbOverload24May2011
CERN	VObox / Lxplus / SVN,CVS / Batch	May 24	3 h	XLDAP overload and nscd problem	Logins blocked, access to software version control blocked, batch jobs failed	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentLdapNscd24052011
PIC	Computing	May 25-26	12h	Batch System	BS instabilities, ~600 jobs lost	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Post_Mortem_PIC_Tier-1_SIR_Computing_SSC5_20110525.pdf
ASGC	Infrastructure	May 21 to May 23	36h	Whole Site	DC Power Cut	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110521_SIR_ASGC_DCPOWERCUT.pdf
CERN	Batch / Lxplus / Vobox / Lxadm / Castor	May 10	8 h	Kerberos KDC	Logins blocked, batch jobs failed, some file access blocked	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentKerberos10052011
RAL	DB	May 10	1h	LFC Outage After Database Update	>80%	https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110510_LFC_Outage_After_DB_Update
ASGC	Network	May 01 to May 08	8 days	Storage service and CMS Squid	Slow transfer from/to ASGC Taiwan	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110501_SIR_ASGC_10GbLINKDOWN.pdf
CERN	DB	Apr 28	1.5h	CMS offline DB cluster	service was down for 1.5h	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem28Apr11
IN2P3	Infrastructure	Apr 8	5h	various, incl. batch system, LFC, VOBOX	job failures, various services unavailable	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-PowerIncident-2011-04-08-v0.pdf

Q1 2011

Site	Service Area	Date	Duration	Service	Impact	Report
IN2P3	Storage	Mar 19	3.5h	SRM	dCache SRM was unusable due to internal overload	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir-in2p3-cc-dcachesrmincident-2011-03-19-v2.pdf
CERN	Infrastructure	Mar 19	12h	Batch system	Job submission became slow, then completely unresponsive	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatch190302011
IN2P3	Network	Mar 14	40min	Batch system	no connection to other French sites, but no problems observed for jobs	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-Network-2011-03-14-v1.pdf
CERN	DB	11-Mar-11	5h	CMS offline production db	The database was completely down for ~2 hours and partially not available for 5 hours	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem11Mar11
IN2P3	Infrastructure	Feb 25-26	13h	Batch system	85% of batch system unavailable, jobs lost	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir-in2p3-cc-powerincident-2011-02-25-v0.pdf
IN2P3	Storage	Feb 13	3 h	Storage service	Storage services degraded, no big impact on jobs	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-Network-2011-02-13-v0.pdf
PIC	Storage	21-Jan-11 to 08-Feb-11	18 days	Storage service	250TB of ATLAS data partially unavailable	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110310_SIR_PIC_ATLAS_lost_Files.pdf
KIT	infrastructure	28-Jan-11 to 02-Feb-11	5 days	Batch system, job submission	batch system degraded, reduced # of job slots available	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_PBS-Jan11.pdf
CERN	DB	25-Jan-11	8h	FTS, LFC, SAM, VOMS, dashboards	affected services fail, clients may hang	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem25Jan11
IN2P3	infrastructure	8-Jul-10 to 7-Jan-11	6 months	shared s/w area	jobs fail	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-LHCb-AFS-Latency-2010-S2-v2.pdf
CNAF-BNL	network	23-Aug-10 to 20-Jan-11	months	primary OPN circuit	poor transfer performance; ok when switched to backup	in preparation

2010

Q4 2010

Site	Date	Duration	Service	Area	Impact	Report
CERN	18 Dec	5 days	DB	DB	Service interruption: ATLARC DB following the power cut at CERN CC	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem18Dec10ATLARC
CERN	18 Dec	26 hours for services with weight > 50	power	infrastructure	Interruption of physics services following power cut	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PowerCut101218
CERN	16 Dec	2.5h	DB	DB	ATLR database affected (degradation then complete outage) by FC switch replacement	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem16Dec10
CERN	7 Dec	7 days	CVS	infrastructure	CMSSW CVS migration problems	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCMSSW101210
CERN	Nov/Dec	8 days	DB	DB	Reboots of Instance 4 of ATLR database	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem15Dec10
KIT	26 Nov	1.5h	GGUS	infrastructure	No web access / no ticket update	https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRGGUSfailure20101126
KIT	16 Nov	3.5h	GGUS	infrastructure	No web access/ no ticket update	https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRGGUSfailure20101116
IN2P3	11 Nov	months	AFS	storage/infrastructure	shared s/w area	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/CCIN2P3-WLCGT1SCM-LHCB-SW-Problem-Report-20101111.pdf
NL-T1	26 Oct	48h	DB	DB	Inconsistency of data at SARA	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem26Oct10
CERN	20 Oct	4.5 h	Batch	infrastructure	Severely degraded response from CERN Batch Service	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatch201010
CNAF	6 Oct	5 days	CMS storage	storage	CMS storage down (service interruption) due to GPFS bug	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/POSTmortem-CMS-Oct2010.docx
CERN	4 Oct	2.1 h	MyProxy	middleware/infrastructure	Outage on `myproxy.cern.ch` after incorrect certificate renewal	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentMyProxy041010
IN2P3	Sep 23 - Nov 22	2 months	ATLAS file transfers	storage	Service degradation	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-Dcache-ATLAS-Transfer-Degradation-2010-Q4-v3.pdf

Q3 2010

Site	Date	Duration	Service	Impact	Report
ASGC	24 Sep	5 hours to recover almost services except 3D service wiich costs 3 weeks	DB	DC power cut	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100924_SIR_ASGC_DCPOWERCUT.pdf
CERN	13 Sep	1.5h	CMSR DB	Spontaneous reboots of nodes 2 and 4 of CMSR	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem14Sep10
CERN	10 Sep	4 days	DB	Real time downstream was not set for LFC replication	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem15Sep10
SARA	August	>3weeks	DB	Replication for ATLAS conditions and LHCB conditions to SARA stopped	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem10Sep10
ASGC	31 Aug	4 days	DB	CASTOR outage due to STAGER DB problem	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100831_SIR_ASGC_STAGERFTS_DB.pdf
NL-T1	August	> week	DB	ATLAS NL-T1 cloud down, LHCb T1 site	http://sirs.grid.sara.nl/docs/NL-T1_SIR-20100818.pdf
CERN	23 Aug	35 h	Atlas conditions DB	ATLAS data streaming to Tier1 sites stopped	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem23Aug10
CERN	20 Aug	4h	CMS DB	Instability of node 3 and 4 of CMSR	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem20Aug10
CERN	9 Aug	16h	LHCb online	LHCBONR database unavailable	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem09Aug10
PIC	25 Jul	30h	CE	Service Degradation. Cooling problem causing about 50% of WNs to be shutdown (running jobs killed)	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100727_SIR_PIC_CoolingModule.pdf
PIC	22 Jul	10h	SE	SRM service not available for ATLAS due to a problem with dCache pool costs configuration.	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100727_SIR_PIC_DDN.pdf
PIC	20 Jul	3h	CE	Computing Service not available after SD due to a wrong gridmapdir migration.	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100727_SIR_PIC_Gridmapdir.pdf
CERN	19 Jul	2h	several	Cooling failure in the vault	https://twiki.cern.ch/twiki/bin/view/FIOgroup/513Temp100719
OSG/GOC	15 Jul	4h	GOC	GOC Service outage	https://twiki.grid.iu.edu/bin/view/Operations/GOCServiceOutageJuly162010
CERN	13 Jul	1:30-9:15	CMS DB	Few short interruptions of replication of CMS data from online to offline	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem13July10
KIT	10 Jul	4h +	site	Outage of central and local services due to a cooling failure	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_cooling_failure_20100710.pdf
NL-T1	5 Jul	1 week	SE	Reduced availability caused by data corruption	http://sirs.grid.sara.nl/docs/NL-T1_SIR-20100705.pdf
NDGF	14 Jul	16h	SE	srm.ndgf.org downtime followed by degradation	https://wiki.ndgf.org/display/ndgfwiki/20100714+dCache+server+failure
NDGF	8 Jul	3h	LFC	LFC downtime on lfc1.ndgf.org	https://wiki.ndgf.org/display/ndgfwiki/Operation-Reports-2010.07.08
KIT	5 Jul	18h	SE	CMS dCache SE down because of hardware failure	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_20100706.pdf

Q2 2010

Site	Date	Duration	Service	Impact	Report
RAL	30 June		SE	1083 CMS files were lost.	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100630_Disk_Server_Data_Loss_CMS
CERN	29 June	4 h	CASTOR	CASTOR outage due to AFS	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCastorAFS29Jun2010
CERN	29 June	5 h	AFS	complete FC disk array - affected CASTOR and also LHC!	https://twiki.cern.ch/twiki/bin/view/AFSService/IncidentsArrayFailure29Jun2010
CERN	28 June	4+h	CASTOR	Log volume slowed down the Castor instances	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsSrmLogFlood28Jun2010
ASGC	29 June	~ 15 hours	3D DB	ASGC didn't apply stream LCRs from central 3D DB for 15 hours	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/3D-DB-incident-20100629.pdf
CERN	26 June	~50 min	ATLAS offline DB (ATLR)	9 Oracle services did not fail over properly after a node eviction	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem26June10
CERN	24,25 June	~10h	LHCb Streaming	Streaming of LHCb data to PIC was not working during 10 hours, streaming to other Tier1 sites not working for 40 minutes	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem24June10
CERN	22 June		CASTOR	LDAP high load caused CASTOR to become unresponsive	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsLdapOverloaded22Jun2010
KIT/GridKa	12 June	~3:15h	CMS dCache	Service down	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_20100612.pdf
CERN	7 June	~3h	CREAM CE	Job submission failure	https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCREAMCe070610
CERN	2 June	1 day	ATLAS and LHCb online and offline databases	Database access and quality of DB services compromised	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem02June10
CERN	1 June	~2h	ATLAS offline and LCGR databases	Database services unavailability during scheduled maintenance for rolling upgrade/patching	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem31May10
CERN	31 May	~2h	CMS online	Database services unavailability during scheduled maintenance for rolling upgrade/patching	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem31May10
CERN	26 May	10 days	CMS offline database	Hw failure affecting one node, cluster running at reduced capacity	https://twiki.cern.ch/twiki/bin/view/DB/PostMortem26May10
PIC	21 May	19 hours	Whole site	Site power cut. Outage.	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100521_SIR_PIC_PowerCut.pdf
CERN	14 May	-	CASTOR	Data loss from incorrect recycling	https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsAliceRecycled14May2010
GGUS	12 May	<=4.5 hours	.de domain	Domain does not exist	https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRdeDNSfailure20100512
CERN-ASGC	12-15 May	-	LHCOPN	Reduced bandwidth	SIRCernAsgcLinkMay2010
CNAF	28 and 29 April	9 hours & 12 hours	STORM	SRM blockage (hardware) followed by MCDISK full and STORM bug	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-CNAF--AtlasSRMoutage-April-2010.pdf
IN2P3	26 Apr	17.5 hours	AFS	Distributed File System (AFS) crashed after server overload. Batch also affected.	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-AFSoutage-2010-04-26.pdf
IN2P3	24 Apr	17 hours	Batch	services location service stopped responding to requests blocking most batch system commands	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-BatchOutage-2010-04-24.pdf
IN2P3	20 Apr	9 hours & 5 days	Grid Downtime Notification	Grid downtime notifications were impossible after two consecutive incidents	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-OperationsPortal-2010-04-22v2.pdf

Q1 2010

Site	Date	Duration	Service	Impact	Report
CERN	3 Mar	18 hours	DB Replication	Replication of LHCb conditions Tier0->Tier1, Tier0->online partially down	https://twiki.cern.ch/twiki/bin/view/PDBService/StreamsPostMortem#Replication_of_LHCb_conditions_T
IN2P3	15 Feb	4.25 hours	Batch	Local worker nodes lost network connectivity	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-WNs-disconnected-2010-02-15-2.pdf
PIC	10 feb	7 hours	Spanish-CA CRLs expired at CERN	Complete blackout of services involving grid certificates either personal or host from Spanish CA at CERN: VOMS, FTS, etc.	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_Rediris_wLCG_formatted.pdf https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCAProxy1202210
CERN	7 Feb	4 hours	Batch Tier-0 Atlas RTT cluster	Degraded service on RunTimeTester cluster due to misconfiguration	http://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatch0702210
CERN	30 Jan	2 days	CASTORATLAS	The xroot daemon was looping on the castoratlas name server because of a bug and slowing down all normal name server calls which was causing the migrator policy to fail	https://twiki.cern.ch/twiki/bin/viewauth/CASTORService/IncidentsMigrationBacklog01Feb2010
RAL	29 Jan	5 days	CASTOR - all instances	A scheduled outage to migrate the Castor Databases back to their original disk arrays encountered significant problems resulting in an extended outage.	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100129
ASGC	18 Jan	2 days	power system	power surge for one second and most services were restarted	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_incident_report_Jan18_2010.pdf
GridKa/KIT	13 Jan	26 hours	site BDII and lcg-CE	site BDII query problems and missing lcg-CE information	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_FZK-LCG2_2010-01-13.pdf
IN2P3	4 Jan	6 hours	Batch	Local batch system database server overload.........................	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-lbms-DB-overload-2010-01-04.pdf

2009

Q4 2009

Site	Date	Duration	Service	Impact	Report
PIC	19 Dec	4.5 hours	Cooling	Most of Tier-1 services shutdown to avoid increasing temperature due to cooling failure	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20091219_PIC_Service_Incident_Report.pdf
IN2P3	8 Dec	1.5 hours	Networking	Grid services unavailability caused by load balancing mechanism failure	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir_in2p3network_outage_10_12_2009.pdf
CERN	2 Dec	2 hours +	Site wide power cut	Most CC services down	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem02Dec09
RAL	30 Nov	n/a	Storage	LHCb Data Loss Incident at RAL	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091130
CERN	20 Nov	1h	SRM/ATLAS	SRM high failure rate and restart after thread exhaustion	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20Nov09
CERN	18 Nov	10h	CMS Dashboard	Performance degradation	http://dashboard.cern.ch/reports/CMSmigrationProblem
IN2P3	12 Nov	n/a	Storage	CMS Data Loss Incident at FR-CCIN2P3	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/2009-11-26_CMS_CCIN2P3_Report.pdf
IN2P3	3 Nov	4h	Many	Many services have been disturbed due to automatic reboot of machines	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_CCIN2P3_cooling_outage_03nov2009.doc
IN2P3	14 Oct 2009	13h	batch	only very short jobs able to run	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir_BatchIncident_15_10_09.pdf
CERN	13 Oct 2009	1-2h	CASTOR nameserver sick	All CASTOR services dead	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20091013
RAL	9 Oct	n/a	Storage (Castor)	data loss from Castor	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091009
IN2P3	8 & 10 Oct 2009	11h (8 Oct) and 6h (10 Oct)	SRM crashed	SRM service interrupted	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_CCIN2P3_SRM_incident_08oct2009.doc
RAL	4-9 Oct 2009		disk failures -> Oracle problems	CASTOR, LFC and FTS services down	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091004
ASGC	continuation - xx Nov	MONTHS!!!	DB & DM services	See presentation at DB workshop	http://indico.cern.ch/getFile.py/access?contribId=30&sessionId=4&resId=1&materialId=slides&confId=70892
ASGC	27 Sep - xx Oct	>3 weeks	DBs	down & out.........................	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC-DB-Sep28.pdf

Q3 2009

Site	Date	Duration	Service	Impact	Report
CERN	21 Sep 2009	08:00 - 18:00	DB Replication	ATLAS Replication Tier0->Tier1 down	https://twiki.cern.ch/twiki/bin/view/PDBService/StreamsPostMortem
RAL	15 - 17 Sept 2009	2 days	CASTOR	Disk to Disk (D2D) transfers started failing during a planned upgrade to the NS	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090915
FZK	7 - 16 Sep 2009	10 days	ATLAS RAC	3D Streams replication blocked then degraded	https://twiki.cern.ch/twiki/bin/viewfile/LCG/WLCGServiceIncidents?rev=1;filename=SIR-FZK-20090907.pdf
CERN	5 & 8 Sept 2009	2 * 2 hours	CASTOR LHCb	two Castor Database problems	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090905
CERN	26 Aug 2009	18:40 - 23:30	Batch	Public and production queues closed	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090826
ASGC	17 Jul 2009	6:00 - 10:00	Power cut	Most services went down and restarted	https://twiki.cern.ch/twiki/bin/viewfile/LCG/WLCGServiceIncidents?rev=1;filename=power_cut_ASGC.txt
ATLAS	13 Jul 2009	10:00 - 11:00	Central Catalogs	Degrade of performance	PostMortem13Jul09

Q2 2009

Site	Date	Duration	Service	Impact	Report
NL-T1	STEP09				https://twiki.cern.ch/twiki/pub/Atlas/Step09Feedback/Post_Mortem_STEP09_NL-T1-0.4.pdf
OPN	10 Jun 09	>1 day	LHC OPN	primary circuits to ASGC, CNAF, KIT, NDGF, TRIUMF (incl. backup)	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Fibre_Cut_June_2009.pdf
FZK	STEP09	many days	storage		https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_storage_FZK_GridKa.pdf
ATLAS	27 Jun 09	2 days(?)	PVSS2COOL	online reconstruction was stopped	PostMortem27Jun09
ATLAS	24 Jun 09	8 hours	PanDA and ATLR	Degraded PanDA service, impact on other offline DB services on ATLR	https://twiki.cern.ch/twiki/bin/view/Atlas/PandaAtlrJune2009
CERN	11 Jun 09	n/a	LHCb conditions access, LFC	scalability problem	https://twiki.cern.ch/twiki/bin/view/PSSGroup/LFCReplicaSvcPostMortem
CERN	18 Jun 09	2 hours	Batch & CASTOR services	down	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090618
IN2P3	10 Jun 09	7 hours	GridFTP	Transfers	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_GRID-FTP_OUTAGE_2009_06_11-1.pdf
CERN	4 Jun 09	n/a	CASTOR LHCb	accidental garbage collection of tape0disk1 files	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090604
CERN	3 Jun 09	n/a	CASTOR LHCb	accidental re-enabling of garbage collection in lhcbdata	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090603
CERN	1 Jun 09	~4 hours	DB services	unavailable	https://twiki.cern.ch/twiki/bin/view/PSSGroup/StreamsPostMortem#Network_hardware_problem_affecti
PIC	23 - 26 May 09	3 days	LFC	instability	https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Post_mortem_LFC_indicent_23-26_May_2009_-_WikiPIC.pdf
PIC	14 May 09	5 hours	cooling	down	SIR_PIC_COOLING_OUTAGE_2009_05_14.pdf
SARA	04 May 09	36 hours	MSS	down	SIR_SARA_TAPEBACKEND_OUTAGE_2009_05_04.pdf
IN2P3	3 May 09	44 hours	cooling	down	SIR_COOLING_OUTAGE_2009_05_03.pdf
IN2P3	25 Apr 09	7.5 hours	MSS	down	SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_26-3.pdf
IN2P3	20 Apr 09	12 hours	MSS	down	SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_22.pdf
CERN	12 Apr 09	VOMS: 2 days, SRM: 1 hours	VOMS, SRM	Degraded	VomsPostMortem2009x04x10
PIC	10 Apr 09	8 hours	SRM	ATLAS, CMS and LHCb	20090411_SIR_SRM_PIC.pdf
IN2P3	02 Apr 09	24 hours	tape service	down	https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek090330/IN2P3_02april2009_WLCG_incident_report.doc

Q1 2009

Site	Date	Duration	Service	Impact	Report
RAL	29 Mar 09	33 hours	complete site	down	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090324
RAL	09 Mar 09	24 hours	DNS	all, especially SRM	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090309
CERN	04 Mar 09	3 hours	CASTOR	down	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090304
ASGC	25 Feb 09	days to weeks	many	down	Fire in UPS. Partial report on Tuesday in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek090406

2008

Site	Date	Duration	Service	Impact	Report
NL-T1	21 Oct 08	12 hours	most	..........down..........
ASGC	25 Oct 08	many days	CASTOR	down	http://indico.cern.ch/conferenceDisplay.py?confId=44840
SARA	28 Oct 08	7 hours	SE/SRM/tape b/e	down	https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek081103/post_mortem_tape-system_outage_25_10_in_NL.pdf
PIC	31 Oct 08	10 hours	SRM	down	PICServiceIncidentReport20090416
NDGF	18-20 Oct 08	2 days	streams	-	https://twiki.cern.ch/twiki/bin/view/PSSGroup/StreamsPostMortem#Problem_with_ATLAS_replication_f
CERN	24 Oct 08	3-4 hours	FTS	channels down or degraded	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortemFts24Oct08
CERN	24 Oct 08	2 hours	VOMS	short interrupt then degraded	https://twiki.cern.ch/twiki/bin/view/LCG/VomsPostMortem2008x10x24
RAL	18 Oct 08	55 hours	CASTOR	downtime	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20081018
CERN	24/10/2008	3-4 hours	FTS	channels: CERN-ASGC, CERN-IN2P3, CERN-RAL, NIKHEF-CERN, PIC-CERN, SARA-CERN	https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortemFts24Oct08
RAL	18/10/2008	55 hours	CASTOR	ATLAS	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20081018
NIKHEF	04/10/2008	~36 hours	site	site	https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek081020/post_mortem_NL-T1_power_outage-Oct17.txt
RAL	17/09/2008	17h (LHCb) 12h (ATLAS)	CASTOR	14K files	http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20080917
CNAF	07/09/2008	12+ hours	CASTOR	complete loss of service	https://twiki.cern.ch/twiki/bin/viewfile/LCG/WLCGDailyMeetingsWeek080915?rev=1;filename=post-mortem_of_September_7_CNAF_CASTOR_problem.pdf

Temp Area

TempArea

External References

RAL Service Incidents

LHC Incident
20090411_SIR_SRM_PIC.pdf: 20090411_SIR_SRM_PIC.pdf

SIR_SARA_TAPEBACKEND_OUTAGE_2009_05_04.pdf: SIR for SARA Tapebackend outage 4 to 6 May 2009

SIR_COOLING_OUTAGE_2009_05_03.pdf: SIR for IN2P3 cooling failure of 3 May 2009

SIR_PIC_COOLING_OUTAGE_2009_05_14.pdf: SIR for PIC Cooling Outtage of 14 May 2009

power_cut_ASGC.txt: power cut at ASGC on July 17th

SIR-FZK-20090907.pdf: SIR of FZK degraded ATLAS RAC 7 to 16 Sep 2009

SIR_CCIN2P3_cooling_outage_03nov2009.doc: IN2P3 cooling outage Nov 3rd

2009-11-26_CMS_CCIN2P3_Report.pdf: CMS Data Loss Incident at FR-CCIN2P3

sir_in2p3network_outage_10_12_2009.pdf: SIR of IN2P3 DNS Load Balancing Failure 8 December 2009

SIR-IN2P3-CC-lbms-DB-overload-2010-01-04.pdf: IN2P3 Local batch system database server overload

SIR_FZK-LCG2_2010-01-13.pdf: SIR FZK-LCG2 (GridKa/KIT) - Information system problems on 13th and 14th of January 2010

power_surge_ASGC_20090118.txt: Power surge at ASGC on Jan 18 2009

SIR_Rediris_wLCG_formatted.pdf: SIR_Rediris_wLCG_formatted.pdf

SIR-IN2P3-CC-WNs-disconnected-2010-02-15-2.pdf: Worker node network connectivity loss at IN2P3 15 Feb 2010

SIR-IN2P3-CC-BatchOutage-2010-04-24.pdf: SIR of IN2P3 batch outage of 24/25 April 2010

SIR-IN2P3-CC-OperationsPortal-2010-04-22v2.pdf: SIR for IN2P3 Downtimes Notification Impossible

SIR-IN2P3-CC-AFSoutage-2010-04-26.pdf: SIR for IN2P3 AFS Outage

SIR-CNAF--AtlasSRMoutage-April-2010.pdf: CNAF ATLAS SRM blockage 28 April then MCDISK full STORM bug

20100521_SIR_PIC_PowerCut.pdf: SIR for the power cut affecting PIC Tier1 on 21-22 May 2010

GridKa_SIR_20100706.pdf: GridKa_SIR_20100706.pdf

GridKa_SIR_20100706.pdf: GridKa_SIR_20100706.pdf

20100727_SIR_PIC_DDN.pdf: SRM ATLAS problems at PIC on 22-Jul due to wrong dCache configuration. About 10h.

20100727_SIR_PIC_CoolingModule.pdf: Cooling problem at PIC WN module causing about 50% of WNs to be shutdown (running jobs killed)

20110211_SIR_PIC_ATLAS_lost_files.pdf: Incident with ATLAS lost files at PIC 21/1/2011

20110310_SIR_PIC_ATLAS_lost_Files.pdf: Update to the PIC SIR of lost files with ATLAS (21-Jan-2011 to 8-Feb-2011)

Post_Mortem_PIC_Tier-1_SIR_Computing_SSC5_20110525.pdf: SIR for the computing incident at PIC on 25/26th May 2011

GridKa_SIR_lost_files_alice_20110526.pdf: GridKa_SIR_lost_files_alice_20110526.pdf

20120122SIRPowerandCoolingProblematPIC.pdf: SIR of the power and cooling incident at PIC Jan 22nd 2012

TRIUMF-incident-report-april10-2012.pdf: TRIUMF incident report

20120603_SIR_Cooling_Incident_at_PIC.pdf: Cooling incident at PIC on 3-Jun-2012: Computing service degraded

20120621_SIR_Cooling_Incident_at_PIC.pdf: Cooling incident at PIC 21-Jun-2012: 17% of WNs switched off

SIR-IN2P3-CC-CVMFSSquid-2012-06-24-v1.pdf: software area unavailable at IN2P3 on 24-Jun-2012

20121009_PIC_SIR_ATLAS_deleted_files.pdf: SIR for accidental ATLAS files deletion at PIC

20121210SIRPICLHCblostfilesontape1.pdf: SIR for the lost LHCb tape files at PIC on Dec 2012

KIT_SIR_StorageFTS_20121127.pdf: SIR about offline FTS and dCache pool nodes end of Nov 2012 at GridKa.

Service_Incident_Report_for_BNL_Tier1-06-2013.pdf: Service Incident Report for US ATLAS Tier-1 Center

KIT_SIR_Storage_20131028.pdf: 130 files lost for CMS

KIT_SIR_Storage_20141023.pdf: KIT: SIR: identification of file losses fro tape due to wrong end of tape markers

SIR-IN2P3-CC-network-2014-11-26-v0.pdf: SIR-IN2P3-CC-network-2014-11-26-v0.pdf

SURFsara_Service_Incident_Report_-_bw32-1_backplane.pdf: SURFsara_Service_Incident_Report_-_bw32-1_backplane.pdf

uscmsT1_SIR_042015.pdf: 2015-05 FNAL uscms lost files

uscmsT1_SIR_042015.pdf: 2015-05 FNAL uscms lost files

SIR_Report_T10KD.pdf: T10KD issue at PIC

SIR_PIC_Report_T10KD.pdf: T10KD issue at PIC

SIR_PIC_ATLAS_T10KD_20160519.pdf: T10KD issue at PIC affecting ATLAS

LHCb_Databases_Upgrade_Migration_Incident_report.pdf: LHCb_Databases_Upgrade_Migration_Incident_report.pdf

KIT_SIR_OnlineStorage_2022-03.pdf: SE outage due to network intervention

CVMFSSIR20231003.pdf: CVMFSSIR20231003.pdf

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	2009-11-26_CMS_CCIN2P3_Report.pdf	r1	manage	78.4 K	2009-12-01 - 15:52	DirkDuellmann	CMS Data Loss Incident at FR-CCIN2P3
pdf	20090411_SIR_SRM_PIC.pdf	r1	manage	152.9 K	2009-04-16 - 15:10	OlofBarring
pdf	20091219_PIC_Service_Incident_Report.pdf	r1	manage	23.3 K	2009-12-23 - 11:20	GonzaloMerino	SIR of the cooling incident at PIC on 19 Dec 2009
pdf	20100521_SIR_PIC_PowerCut.pdf	r1	manage	123.4 K	2010-05-31 - 09:39	GonzaloMerino	SIR for the power cut affecting PIC Tier1 on 21-22 May 2010
pdf	20100727_SIR_PIC_CoolingModule.pdf	r1	manage	75.0 K	2010-07-27 - 18:14	GonzaloMerino	Cooling problem at PIC WN module causing about 50% of WNs to be shutdown (running jobs killed)
pdf	20100727_SIR_PIC_DDN.pdf	r1	manage	63.4 K	2010-07-27 - 17:25	GonzaloMerino	SRM ATLAS problems at PIC on 22-Jul due to wrong dCache configuration. About 10h.
pdf	20100727_SIR_PIC_Gridmapdir.pdf	r1	manage	48.0 K	2010-07-27 - 17:23	GonzaloMerino	CE failure at PIC of 3hrs on 20-Jul due to a faulty gridmapdir migration.
pdf	20100831_SIR_ASGC_STAGERFTS_DB.pdf	r2 r1	manage	246.7 K	2010-09-13 - 18:26	JhenWeiHuang	20100831_SIR_ASGC_STAGERFTS_DB.pdf
pdf	20100924_SIR_ASGC_DCPOWERCUT.pdf	r1	manage	249.2 K	2010-10-16 - 22:40	JhenWeiHuang	20100924_SIR_ASGC_DCPOWERCUT
pdf	20110211_SIR_PIC_ATLAS_lost_files.pdf	r1	manage	45.7 K	2011-02-11 - 13:29	GonzaloMerino	Incident with ATLAS lost files at PIC 21/1/2011
pdf	20110310_SIR_PIC_ATLAS_lost_Files.pdf	r1	manage	74.2 K	2011-03-10 - 13:44	GonzaloMerino	Update to the PIC SIR of lost files with ATLAS (21-Jan-2011 to 8-Feb-2011)
pdf	20110501_SIR_ASGC_10GbLINKDOWN.pdf	r1	manage	187.2 K	2011-05-11 - 19:03	JhenWeiHuang	20110501_SIR_ASGC_10GbLINKDOWN.pdf
pdf	20110521_SIR_ASGC_DCPOWERCUT.pdf	r1	manage	114.7 K	2011-05-27 - 07:56	JhenWeiHuang	SIR for ASGC DC Power Cut on 21 May 2011
pdf	20110727GGUS_Service_Incident_Report.pdf	r1	manage	51.0 K	2011-07-27 - 12:00	DirkDuellmann
pdf	20120122SIRPowerandCoolingProblematPIC.pdf	r1	manage	313.2 K	2012-02-03 - 14:50	GonzaloMerino	SIR of the power and cooling incident at PIC Jan 22nd 2012
doc	20120310SIRATLASlostfileonLTO5tapeG05918.doc	r1	manage	34.0 K	2012-04-13 - 21:16	AlexeySedov	ATLAS Tape incident at PIC
pdf	20120603_SIR_Cooling_Incident_at_PIC.pdf	r1	manage	54.4 K	2012-06-04 - 23:58	GonzaloMerino	Cooling incident at PIC on 3-Jun-2012: Computing service degraded
pdf	20120621_SIR_Cooling_Incident_at_PIC.pdf	r1	manage	86.0 K	2012-06-28 - 11:57	GonzaloMerino	Cooling incident at PIC 21-Jun-2012: 17% of WNs switched off
pdf	20120729_SIR_ASGC_STAGERDB.pdf	r1	manage	293.2 K	2012-11-21 - 18:55	JhenWeiHuang	20120729_SIR_ASGC_STAGERDB
pdf	20121009_PIC_SIR_ATLAS_deleted_files.pdf	r1	manage	61.3 K	2012-10-26 - 13:40	GonzaloMerino	SIR for accidental ATLAS files deletion at PIC
pdf	20121210SIRPICLHCblostfilesontape2.pdf	r1	manage	58.7 K	2012-12-14 - 15:54	GonzaloMerino	SIR for the lost LHCb tape files at PIC on Dec 2012
pdf	20170321_SIR_CERN_PHEDEX.pdf	r1	manage	49.0 K	2017-04-03 - 13:24	KateDziedziniewicz	CMS Phedex not working at CNAF/WISCONSIN after CMSR migration
pdf	3D-DB-incident-20100629.pdf	r1	manage	43.4 K	2010-06-30 - 20:22	FelixLee	ASGC 3D DB incident report 20100629
pdf	ASGC-DB-Sep28.pdf	r1	manage	22.5 K	2009-10-12 - 17:05	JamieShiers
pdf	ASGC-SIR20130324-Atlas_file_lost.pdf	r2 r1	manage	31.4 K	2013-05-08 - 00:40	FelixLee	ASGC file loss to Atlas MCTAPE
pdf	ASGC_DATA_LOSS_SIR-NOV_2013.pdf	r1	manage	36.5 K	2013-11-21 - 14:41	FelixLee	ASGC_DATA_LOSS_SIR-NOV2013
pdf	ASGC_SIR_2012-04-11.pdf	r1	manage	281.0 K	2012-05-03 - 10:35	JhenWeiHuang	ASGC_SIR_2012-04-11.pdf
pdf	ASGC_incident_report_Jan18_2010.pdf	r2 r1	manage	16.6 K	2010-02-02 - 02:56	HorngLiangShih
pdf	CCIN2P3-WLCGT1SCM-LHCB-SW-Problem-Report-20101111.pdf	r1	manage	78.3 K	2011-01-18 - 10:40	JamieShiers	CCIN2P3 Shared s/w area interim report
pdf	CERN_OCSP_incident_report.pdf	r1	manage	46.9 K	2020-06-29 - 21:40	MaartenLitmaath	CERN Grid CA OCSP incident report, June 24-25, 2020
pdf	CVMFSSIR20231003.pdf	r1	manage	59.0 K	2023-12-08 - 15:42	AlastairDewhurst
pdf	Fibre_Cut_June_2009.pdf	r1	manage	177.9 K	2009-07-06 - 08:30	JamieShiers
doc	GOCDB_Outage_5th_March_2014.doc	r1	manage	30.5 K	2014-03-13 - 15:52	MaartenLitmaath	GOCDB Outage 5th March 2014
pdf	GridKa_SIR_20100612.pdf	r1	manage	28.5 K	2010-06-15 - 15:19	UnknownUser	CMS dCache down for approx. 3h15
pdf	GridKa_SIR_20100706.pdf	r1	manage	34.2 K	2010-07-07 - 23:44	JosVanWezel
pdf	GridKa_SIR_PBS-Jan11.pdf	r1	manage	47.7 K	2011-02-07 - 14:57	AndreasHeiss	SIR about GridKa local batch system problems, January 2011
pdf	GridKa_SIR_lost_files_alice_20110526.pdf	r1	manage	8.2 K	2011-06-06 - 17:23	JosVanWezel	KIT SIR loast files ALICE 5/2011
pdf	GridKa_Service_Incident_Report_12082011.pdf	r1	manage	461.9 K	2011-12-12 - 15:00	XavierMol
pdf	KDC-SIR.pdf	r2 r1	manage	66.1 K	2011-08-23 - 14:52	DirkDuellmann
pdf	KIT_SIR_CMSChimeraDatabase_2018-08.pdf	r1	manage	196.1 K	2018-08-20 - 10:15	XavierMol	Database incident CMS dCache Aug 2018
pdf	KIT_SIR_OnlineStorage_2022-03.pdf	r1	manage	167.3 K	2022-08-12 - 10:33	XavierMol	SE outage due to network intervention
pdf	KIT_SIR_StorageFTS_20121127.pdf	r1	manage	298.0 K	2013-01-22 - 16:01	XavierMol	SIR about offline FTS and dCache pool nodes end of Nov 2012 at GridKa.
pdf	KIT_SIR_Storage_20131028.pdf	r2 r1	manage	429.2 K	2014-04-08 - 08:29	XavierMol	130 files lost for CMS
pdf	KIT_SIR_Storage_20141023.pdf	r1	manage	203.8 K	2014-10-31 - 13:06	ThomasHartmann	KIT: SIR: identification of file losses fro tape due to wrong end of tape markers
pdf	KIT_SIR_TapeStorage_2017-12.pdf	r1	manage	195.6 K	2018-03-13 - 08:57	XavierMol	SIR KIT Tape Storage Q4 2017
pdf	LHCb_Databases_Upgrade_Migration_Incident_report.pdf	r1	manage	43.1 K	2018-03-21 - 18:27	IgnacioCoterillo
docx	POSTmortem-CMS-Oct2010.docx	r1	manage	117.8 K	2010-10-15 - 13:51	MaartenLitmaath	CMS storage down at CNAF Oct 6-10, 2010
doc	PostMortemTier-1ServiceIncidentRAIDCORRUPTIONAdaptec644515-03-2012.doc	r1	manage	52.5 K	2012-04-13 - 21:13	AlexeySedov	ATLAS Data Loss Incident at PIC
pdf	Post_Mortem_PIC_Tier-1_SIR_Computing_SSC5_20110525.pdf	r1	manage	94.9 K	2011-06-01 - 16:18	UnknownUser	SIR for the computing incident at PIC on 25/26th May 2011
pdf	Post_Mortem_Tier-1_Service_Incident_dCache_PNFS_overload_10-June-2011.pdf	r1	manage	129.5 K	2011-06-14 - 17:14	UnknownUser
pdf	Post_Mortem_Tier-1_Service_Incident_dCache_PNFS_overload_10-June-2011_f.pdf	r1	manage	129.9 K	2011-06-16 - 15:53	UnknownUser
pdf	Post_mortem_LFC_indicent_23-26_May_2009_-_WikiPIC.pdf	r1	manage	163.7 K	2009-05-27 - 17:28	JamieShiers
pdf	RAL-LCG2NetworkOutageSIROctober2022.pdf	r1	manage	102.0 K	2023-04-04 - 17:01	AlastairDewhurst	RAL-LCG SIR Network Outage October 2022
pdf	SIR-2018-CCIN2P3-DiskServerFailure.pdf	r1	manage	416.0 K	2018-10-05 - 16:26	EricFede	SIR for CCIN2P3 Data lost on xrootd storage
pdf	SIR-ALICE-KIT-overload-v2.pdf	r1	manage	78.8 K	2014-05-07 - 18:52	MaartenLitmaath	SIR about KIT firewall and OPN overload by ALICE jobs
pdf	SIR-CNAF--AtlasSRMoutage-April-2010.pdf	r1	manage	112.5 K	2010-05-10 - 14:22	HarryRenshall	CNAF ATLAS SRM blockage 28 April then MCDISK full STORM bug
pdf	SIR-FZK-20090907.pdf	r1	manage	74.9 K	2009-09-29 - 14:42	HarryRenshall	SIR of FZK degraded ATLAS RAC 7 to 16 Sep 2009
pdf	SIR-IN2P3-CC-AFSoutage-2010-04-26.pdf	r1	manage	12.0 K	2010-05-07 - 11:14	HarryRenshall	SIR for IN2P3 AFS Outage
pdf	SIR-IN2P3-CC-BatchOutage-2010-04-24.pdf	r1	manage	15.4 K	2010-05-04 - 09:48	HarryRenshall	SIR of IN2P3 batch outage of 24/25 April 2010
pdf	SIR-IN2P3-CC-CVMFS-2012-07-03-v0.pdf	r1	manage	6.9 K	2012-07-18 - 23:06	MaartenLitmaath	IN2P3-CC CVMFS inconsistency
pdf	SIR-IN2P3-CC-CVMFSSquid-2012-06-24-v2.pdf	r1	manage	8.7 K	2012-08-29 - 22:17	MaartenLitmaath	software area unavailable at IN2P3 on 24-Jun-2012
pdf	SIR-IN2P3-CC-LHCb-AFS-Latency-2010-S2-v2.pdf	r1	manage	212.3 K	2011-02-14 - 22:14	MaartenLitmaath	Slow AFS response causing environment setup timeout for LHCb jobs
pdf	SIR-IN2P3-CC-Network-2011-02-13-v0.pdf	r1	manage	6.8 K	2011-03-01 - 15:45	MaartenLitmaath	IN2P3-CC core network switch outage due to CPU card failure
pdf	SIR-IN2P3-CC-Network-2011-03-14-v1.pdf	r1	manage	6.2 K	2011-03-25 - 16:07	MaartenLitmaath	IN2P3-CC hardware failure on network equipment
pdf	SIR-IN2P3-CC-OperationsPortal-2010-04-22v2.pdf	r2 r1	manage	17.2 K	2010-05-07 - 11:14	HarryRenshall	SIR for IN2P3 Downtimes Notification Impossible
pdf	SIR-IN2P3-CC-PowerIncident-2011-04-08-v0.pdf	r1	manage	8.1 K	2011-04-14 - 11:29	MaartenLitmaath	IN2P3-CC power incident Apr 8
pdf	SIR-IN2P3-CC-PowerIncident-2011-08-26-v2.pdf	r1	manage	24.3 K	2011-09-14 - 20:50	MaartenLitmaath	IN2P3-CC cooling system failure Aug 26
pdf	SIR-IN2P3-CC-WNs-disconnected-2010-02-15-2.pdf	r1	manage	10.5 K	2010-02-25 - 14:28	HarryRenshall	Worker node network connectivity loss at IN2P3 15 Feb 2010
pdf	SIR-IN2P3-CC-dCache-2012-07-01-v1.pdf	r1	manage	6.7 K	2012-07-18 - 22:59	MaartenLitmaath	IN2P3-CC dCache downtime due to leap second
pdf	SIR-IN2P3-CC-lbms-DB-overload-2010-01-04.pdf	r1	manage	30.1 K	2010-01-11 - 16:08	DirkDuellmann	IN2P3 Local batch system database server overload
pdf	SIR-IN2P3-CC-network-2012-06-29-v0.pdf	r1	manage	5.7 K	2012-07-16 - 20:04	MaartenLitmaath	IN2P3-CC network outage
pdf	SIR-IN2P3-CC-network-2014-11-26-v0.pdf	r1	manage	31.6 K	2014-12-01 - 10:00	AndreaSciaba
pdf	SIR-IN2P3-CC-network-2015-11-03-v3.pdf	r1	manage	33.1 K	2015-11-12 - 14:18	AndreaSciaba
pdf	SIR-IN2P3-Dcache-ATLAS-Transfer-Degradation-2010-Q4-v3.pdf	r1	manage	281.6 K	2011-02-11 - 19:27	MaartenLitmaath	IN2P3-CC dCache transfer degradation for ATLAS
pdf	SIR20120921.pdf	r1	manage	31.9 K	2012-10-16 - 18:31	MaartenLitmaath	CNAF LHCb SE 6d downtime
pdf	SIR_201705.pdf	r1	manage	127.2 K	2017-06-06 - 12:11	MaartenLitmaath	GGUS outage of 2017-05-31
pdf	SIR_ASGC_July_2012.pdf	r1	manage	292.8 K	2012-11-21 - 18:42	JhenWeiHuang	SIR_ASGC_July_2012
pdf	SIR_BNL_CONDB.pdf	r1	manage	58.3 K	2011-09-29 - 15:12	MariaGirone
pdf	SIR_BNL_DB_CFG.pdf	r2 r1	manage	50.6 K	2011-09-20 - 10:01	MariaGirone
pdf	SIR_CCIN2P3_15aug2011.pdf	r1	manage	32.8 K	2011-08-22 - 17:12	JamieShiers
pdf	SIR_CCIN2P3_19july2011.pdf	r1	manage	37.0 K	2011-08-01 - 15:53	MaartenLitmaath	IN2P3-CC database incidents due to disk drive failures
doc	SIR_CCIN2P3_SRM_incident_08oct2009.doc	r1	manage	71.5 K	2009-10-12 - 14:22	JamieShiers
doc	SIR_CCIN2P3_cooling_outage_03nov2009.doc	r1	manage	12.5 K	2009-11-06 - 17:37	DirkDuellmann	IN2P3 cooling outage Nov 3rd
pdf	SIR_CNAF_20190829.pdf	r1	manage	49.9 K	2019-08-29 - 18:42	MaartenLitmaath	CNAF site outage Aug 6-21, 2019
pdf	SIR_COOLING_OUTAGE_2009_05_03.pdf	r1	manage	26.7 K	2009-05-22 - 14:05	HarryRenshall	SIR for PIC cooling failure of 14 May 2009
pdf	SIR_FZK-LCG2_2010-01-13.pdf	r1	manage	28.5 K	2010-01-15 - 12:58	UnknownUser	SIR FZK-LCG2 (GridKa/KIT) - Information system problems on 13th and 14th of January 2010
pdf	SIR_GRID-FTP_OUTAGE_2009_06_11-1.pdf	r1	manage	73.9 K	2009-06-16 - 11:06	JamieShiers
pdf	SIR_PIC_ATLAS_T10KD_20160519.pdf	r1	manage	24.3 K	2016-05-19 - 10:05	AreshVedaee	T10KD issue at PIC affecting ATLAS
pdf	SIR_PIC_COOLING_OUTAGE_2009_04_14.pdf	r1	manage	32.0 K	2009-05-22 - 14:21	HarryRenshall	SIR for PIC cooling failure of 2009.05.14
pdf	SIR_PIC_COOLING_OUTAGE_2009_05_14.pdf	r1	manage	32.0 K	2009-05-22 - 14:26	HarryRenshall	SIR for PIC Cooling Outtage of 14 May 2009
pdf	SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_22.pdf	r1	manage	22.8 K	2009-04-25 - 10:06	DirkDuellmann
pdf	SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_26-3.pdf	r1	manage	17.6 K	2009-04-30 - 11:50	JamieShiers
pdf	SIR_SARA_TAPEBACKEND_OUTAGE_2009_05_04.pdf	r1	manage	22.0 K	2009-05-07 - 15:27	HarryRenshall	SIR for SARA Tapebackend outage 4 to 6 May 2009
pdf	SIR_cooling_failure_20100710.pdf	r1	manage	53.4 K	2010-07-19 - 14:28	UnknownUser	SIR of the cooling incident at KIT on July 10
pdf	SIR_storage_FZK_GridKa.pdf	r1	manage	51.7 K	2009-07-02 - 14:17	JamieShiers
pdf	SIRondatalossinASGCinOct.2016.pdf	r1	manage	32.1 K	2016-11-11 - 14:21	MaartenLitmaath	ASGC - loss of ATLAS data, 18 Oct 2016
xlsb	SIRs-by-Q-2012.xlsb	r1	manage	43.8 K	2012-11-23 - 14:06	JamieShiers	Spreadsheet for producing SIR plots for WLCG QRs
pdf	SURFsara_SIR_network_outage_30-6-2016.pdf	r1	manage	57.0 K	2016-07-13 - 14:36	UnknownUser
pdf	SURFsara_Service_Incident_Report_-_bw32-1_backplane.pdf	r1	manage	4267.2 K	2015-02-09 - 16:58	AndreaSciaba
pdf	ServiceIncidentReport_20230329.pdf	r1	manage	147.5 K	2023-03-30 - 15:09	MaartenLitmaath	GGUS outage March 29-30, 2023
pdf	Service_Incident_Report.pdf	r1	manage	177.2 K	2014-01-14 - 12:09	SimoneCampana	Service instabilities in the SURFsara grid storage cluster
pdf	Service_Incident_Report_for_BNL_Tier1-06-2013.pdf	r1	manage	28.3 K	2013-06-26 - 21:56	MichaelErnst	Service Incident Report for US ATLAS Tier-1 Center
pdf	Storage_incident_report_at_TRIUMF_Sep-16-2013.pdf	r1	manage	46.6 K	2013-09-25 - 00:46	RedaTafirout	TRIUMF incident report (lost files)
pdf	TRIUMF-dcs08lun0_incident_20161218.pdf	r1	manage	41.7 K	2017-01-25 - 18:05	DiQing	ATLAS lost files at TRIUMF due to hardware/firmware issue on December 18 2016
pdf	TRIUMF-incident-report-april10-2012.pdf	r1	manage	29.8 K	2012-04-27 - 02:36	RedaTafirout	TRIUMF incident report
pdf	WLCG_AuthZ_Meeting_-_ATLAS_IAM_Outage_(31:10:2022)_-_CodiMD.pdf	r1	manage	354.6 K	2022-11-28 - 12:25	HannahShort
pdf	post-mortem-CNAF-CE-Problem-Sept-2016.pdf	r1	manage	141.2 K	2016-10-17 - 20:22	MaartenLitmaath
txt	power_cut_ASGC.txt	r1	manage	0.6 K	2009-07-31 - 16:19	GangQin	power cut at ASGC on July 17th
txt	power_surge_ASGC_20090118.txt	r1	manage	0.8 K	2010-02-01 - 12:59	GangQin	Po
pdf	sir-in2p3-cc-dcachesrmincident-2011-03-19-v2.pdf	r1	manage	7.1 K	2011-03-28 - 14:08	MaartenLitmaath	IN2P3-CC dCache SRM overload
pdf	sir-in2p3-cc-powerincident-2011-02-25-v0.pdf	r1	manage	7.8 K	2011-03-07 - 19:18	MaartenLitmaath	IN2P3-CC power incident Feb 25
pdf	sir-kit-atlas-dcache-20110728.pdf	r1	manage	25.9 K	2011-07-28 - 14:18	AndreasPetzold	SIR ATLAS dCache data loss at KIT July 2011
pdf	sir_BatchIncident_15_10_09.pdf	r1	manage	29.9 K	2009-10-15 - 16:07	JamieShiers
pdf	sir_in2p3network_outage_10_12_2009.pdf	r1	manage	48.8 K	2009-12-14 - 10:01	HarryRenshall	SIR of IN2P3 DNS Load Balancing Failure 8 December 2009
pdf	uscmsT1_SIR_042015.pdf	r2 r1	manage	46.7 K	2015-05-04 - 15:00	LucaMascetti	2015-05 FNAL uscms lost files

Topic revision: r302 - 2023-12-08 - AlastairDewhurst

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback