AFS operations meeting 2014-11-10
present: Massimo, Dan, Jan (Kuba: updating CERNBOX..)
Issues:
- afs270 got stuck "à la BOINC" last Monday/Tuesday. Single work volume was overloaded and throttled, stayed throttled after load dropped off, user did some "fs sa"-manipulations later. Initially blocked only access to that volume, evtl "vos" commands against the machine got stuck. Core dumped, looong salvage, general unhappiness (incl GIT/SVN and CDS/Inspire sitting on "std" partitions)
- readonly-monitor also affected, had corrupted state files (mails), cleaned by deleting these..
- ABS - still/again seeing CASTOR NS unavailability. Might be thread exhaustion on the nameserver (+1 users doing "expensive" huge directory listing). Number of NSD threads got changed this morning.
Recent changes/ ongoing work:
- Dan: will try "fileserver/vanilla" 1.6.10 fileserver for BOINC (have something similar in the "testcell" branch", but not merged, and needs changes to work in "cern.ch")
- PTS cleanup: removed 28k host-based entries
- afs255/afs256 - still fixing the leftover corrupted volumes from power cut/SAS-tray change
- VLDB - have corrupted volume entries, prevent removing 2 old+dead fileservers, unclear how to fix without affecting random other volumes. Not affecting production.
- seen when trying to get list of our servers, itself needed for "setserverprefs" script, needed for having AFS servers at Wigner (without messing up Meyrin clients)
Discussion:
- will have CDS/Indico/Inspire/xyz meeting to discuss using EOS (or CERNBOX or ...) instead of AFS; increased interest from them (incl EOSPUBLIC-for-Inspire ticket) after recent series of issues affecting them
- move to critical servers: waiting for afs230/afs231 (back from CERNBOX?)
- client crashes: should we go completely vanilla on the client? needs changes in (desktop) ncm-afsclt (other init scripts), but may want to drop the "big retry loop" patch in 1.6.10
There are minutes attached to this event.
Show them.