AFS ops meeting 2014-11-03
present: Dan, Kuba, Massimo, Jan
Issues:
- BOINC server hangs: was OK over weekend, just started hanging again several time (and is now auto-SEGV-ed by Dan's cron job). Suspicion is that the CERN rescheduling patch (now?) has some concurrency issue, possibly linked to "fs setacl" (which BOINC/sixtrack did repeatedly, removed from one script but many otehrs had copied that). Dan will try out a minimally-modified upstream 1.6.10
- Backup/ABS:
- will better handle corrupted/0-size CASTOR files, in particular not ever consider these to be useful for incremental backups (and eventually also auto-clean these)
- Massimo: TSM usage decreasing as expected? Should go from initial 2.5PB to 500TB (last fully copy of all AFS, will stay until manually removed - once restores from TSM have been disabled). Kuba will check.
- Indico meeting: no "spare" server on critical power? (only afs260-267 are in barn, all assigned to volsets "users idle_projects". afs262 has less users (why?) ).
- could try to get ex-afs240,afs241 back (but already renamed..)
- could wait for Kuba to liberate afs230,afs231 (cernbox) - ETA 2 weeks (OK)
- side discussion - both "critical power" and "UPS" ought to be easily queriable (aka become a puppet "fact")
- PTS IP/DNS cleanup - done, 27k entries gone (have list); will run once/day
- should also remove the "parked" IPs that can be recycled at short notice from OpenStack, have some fairly random name.
- afsmisc - move "arcserver" functionality to VM tomorrow morning; can use "afs_admin -s afsarc1.cern.ch" to test
- ASIS removal - why touch this?? some volumes broken; hardcoded; "odd" use case (e.g external IPs). Removal is great way to make friends.. Should doublecheck that AFS is not using anything (perl?) from there..
Discussion:
Is it worth to run AFS servers without AFS client access? Would need to package the existing code. (several benefits: would allow to attack the mess of dependencies under /p/, can do staged roll-out of code, can use in test cell)? Unclear:
- some AFS client access might be required for some of the functionality (mount restore volumes?)
- /p/cernafs/ was an attempt to migrate existing code into some new self-contained structure, seems not to be progressing much (coud use GIT checkouts to get code onto machines, instead of RPMs)
- Alternative is to review existing functionality, rewrite along real existing use (possibly in a different context, aka "test cell" - can use non-admin people for that). Once feature parity is reached, no longer need to look at original code. Of course, new bugs will come in as well.
Discussion moved on to whether "arc" can be replaced - use SSH or REST for remote calls. But major mess is in the procedure on the server side, not the actual ARC server. Agreement that "chained" ARC procedure (hop from one user to other, via ARC calls in some perl module) should die.
[no real conclusion; can stay as low-prio target; simplification is obviously good]
There are minutes attached to this event.
Show them.