ASGC Tier1 Oracle Problems 20081105 Site: Taiwan-LCG2 Incident Date: Oct. 25, 2008 Services Affected: Oracle Database Services Impact: Data transmission for ATLAS and CMS Incident Summary: Oracle system data corrupted (undo and redo files), and false acknowledgment during recovery. Castor and SRM services were done from Oct. 25 till Nov. 25. Report Date: Dec. 1, 2008 Incident Details: 1. 25/10/08, Castor frontend throughput observed decreasing to 0% and team ticket raised (ggus ticket open 42913 42878 and 42766). a. ORA Error Database error, Oracle code: 600 ORA-00600: internal error code, arguments: [ktspfredo-4], [0], [0], [], [], [], [], [] b. Sent Service Request to Oracle: #7164142.993 2. 29/10/08, While tracking the ora error, the share memory processes have been forced killed and result in corrupted data files a. Restart db services will fail with CRS-1006: No more members to consider i. Registry integrity check succeeded at all instances b. trying to recover the control file fail with ORA-01110: data file 82 and 83 c. after recovering the db i. alter db open and reset log failed with ORA-01152: file 1 was not restored from a sufficiently old backup . 3. 2/11/08, recover the db via RMAN and offline drop data file #25 due to the unrecoverable error. All instances and db services able to startup normally after Nov 2nd 18:00 4. 6/11/08, by dropping one of the table space from dba_rollback_segs specific for one of the instance serving stager service, one of the rollback segments have status NEEDS RECOVERY , and have recreate the undo log and restart everything. Afterward, the stager service had been recovered after redo the statistics. 5. 6/11/08 – 25/11/08 a. Remove problematic procedures found from error logs, and recompiled invalide procedures. b. Found many index blocks (File #10) corrupted c. Cleanup and enlarge partitions for recovery, and remove obsolete archive logs d. Recreate REDO log files with verification for the error found – e. Session number increased to resolve the error - “Oracle Code: 12520 ORA-12520: TNS: listener could not find available handler for requested type of server” 6. 25/11/08, all Castor, SRM, and Oracle services resumed and verified by experiment data transmission efficiency to almost 100%. 7. Follow-up actions (will be committed by end of 2008) a. File system migration from OCFS2 to ASM b. Revise recovery strategy, strengthen the service backup policy c. Need much better way to improvement the communication and event report/interaction, such as station at CERN.