AFS weekly meeting

Europe/Zurich
31/1-012 (CERN)

31/1-012

CERN

6
Show room on map
Description
AFS service/operation meeting

AFS operations meeting 2014-11-10

present: Massimo, Dan, Jan (Kuba: updating CERNBOX..)

Issues:

  • afs270 got stuck "à la BOINC" last Monday/Tuesday. Single work volume was overloaded and throttled, stayed throttled after load dropped off, user did some "fs sa"-manipulations later. Initially blocked only access to that volume, evtl "vos" commands against the machine got stuck. Core dumped,  looong salvage, general unhappiness (incl GIT/SVN and  CDS/Inspire sitting on "std" partitions)
    • readonly-monitor also affected, had corrupted state files (mails), cleaned by deleting these..
  • ABS - still/again seeing CASTOR NS unavailability. Might be thread exhaustion on the nameserver (+1 users doing "expensive" huge directory listing). Number of NSD threads got changed this morning.

Recent changes/ ongoing work:

  • Dan: will try "fileserver/vanilla" 1.6.10 fileserver for BOINC  (have something similar in the "testcell" branch", but not merged, and needs changes to work in "cern.ch")
  • PTS cleanup: removed 28k host-based entries
  • afs255/afs256 - still fixing the leftover corrupted volumes from power cut/SAS-tray change
  • VLDB - have corrupted volume entries, prevent removing 2 old+dead fileservers, unclear how to fix without affecting random other volumes. Not affecting production.
    • seen when trying to get list of our servers, itself needed for "setserverprefs" script, needed for having AFS servers at Wigner (without messing up Meyrin clients)

Discussion:

  • will have CDS/Indico/Inspire/xyz meeting to discuss using EOS (or CERNBOX or ...) instead of AFS; increased interest from them (incl EOSPUBLIC-for-Inspire ticket) after recent series of issues affecting them
    • move to critical servers: waiting for afs230/afs231 (back from CERNBOX?)
  • client crashes: should we go completely vanilla on the client? needs changes in (desktop)  ncm-afsclt (other init scripts), but may want to drop the "big retry loop" patch in 1.6.10
There are minutes attached to this event. Show them.
The agenda of this meeting is empty