AFS operation meeting 2014-09-29
present: Kuba, Massimo, Jan
Issues:
- ABS: CASTOR nameserver issues over the weekend ("host not found"), under investigation with Xavi. Looks like a dodgy node in the CASTORNS DNS alias, problem was not seen while only 1 nameserver was used.
- ABS: 1 corrupted (0-size) dump on Saturday. Have FileID, will look at logviewer.
- afs500/501: get "abrt" mails from crashing (python) hwcollect script. Already escalated to Eric, linked to the missing SAS tray. Turn off "abrt" for these scripts?
- SURE: embarrasing that this is still running (might not even work anymore? operators haven't received anything recently.. ). Massimo: Turn off. Investigate proper replacement later
- Pedro invited (only some people ?) to a monitoring meeting, now apparently scheudled for next Tue 14:00
- AFS DB update: rescheduled to Monday 2014-10-06 07:30 (Jan).
- Steps: turn off writers (AIM, cronjobs, afs_adin "pv" (notify power users?)). Scripted updated, roll back to 1.4 in case of failures.
- Suggestions:
- check for Backup of DB files. (done: cronjob on AFSDB3, stores into AFS)
- check remaining VMs for other cron jobs
- (forgotten but on Jan's list: AFS client stats are broken on 1.6).
There are minutes attached to this event.
Show them.