INDICO/INSPIRE/CDS - AFS meeting

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map

INDICO/INSPIRE/CDS - AFS meeting 2014-10-24

present: Massimo, Nikos, Kuba, Dan, Pedro, Thomas, Jan

INDICO had two recent issues with AFS:

  1. Isolation: discovered that their project volumes were colocated with other projects (contrary to expectations - initial agreement seems to have been a dedicated server, but that was before several rounds of hardware replacements). Colocation was working OK until last week when the server was overloaded by an experiment.
  2. Critical power: their AFS volumes were unavailble during the powercut, and did not come back immediately after (~30min extra delay, manual action required; additional delays until full network was back). INDICO servers are on critical power, would expect their dependencies to be similarly protected (and having AFS servers inside critical area would have made their network connection to be protected as well. Issue is worse because AFS access to a nonresponding server will hang the client in kernel space (consume threads on webserver, may run out of threads).

Discussion:

Isolation (by having project-dedicated servers): might have been discussed initially (service manager changes..) but nowadays is more difficult with the relatively big diskservers in use (1 machine: 30TB, CDS+INDICO ~ 17TB), would need to have a second machine (so that volmes can be moved for interventions). Would prefer to try without. Issue with the per-volume throttling patch is still under investigation, hope to be able to fix.

Critical power: intitially the discussion went around whether these projects can be split into critical and less-critical (e.g. content from previous years), in order to not have to provide the full content on critical power (expect strong growth over coming years). Agreement that the overall volume of "old" data for now does not warrant this right now.  AFS may have 2 servers on critical power (besides user home directory servers) that can be used for this, but would reserve the right to place other similarly-critical projects there.

Other single-point-of failures even on critical power: neither CDS nor INDICO use AFS readonly replicas, so a single machine going down (hardware or software) will make all of these project inaccessible. At least p.indico.root and p.cds.root should be readonly with some replicas.

Other points touched:

  • other use cases:
    • AFS has plenty of IP-based ACLs for INDICO = DB "backup" into AFS, should use TSM directly?
      • little load, not crucial for the service
      • will most likely disappear with move to DB-on-demand
    • temp files, caching, logging mention during the discussion? use case requires a shared FS, but low-frequency. Could be moved but is probably not worthwhile. Clients don't connect directly to the data, always goes via frontend machines (7x INDICO, 5x CDS)
  • would DSS recommend any other storage service as a short-term migration target (CEPHFS, NFS, EOS)? not now, not for a production "critical" service (only CEPH block store is at that level now), AFS is backed up.
  • INDICO would like to eventually use a different protocol, ideally objectstore with direct HTTP accessibility via 30x redirects (aka S3) 
    • ongoing discussion between IT-DSS and ZENODO, should join (so that IT-CIS projects  go for a common technology)
    • will contact DSS again in 2015 once clearer.

Actions:

  • AFS service to identify a suitable AFS server on critical power, move indico  + CDS there.
  • AFS service to provide docs on how to use readonly volumes for reliability (see http://information-technology.web.cern.ch/book/afs-administrator-guide, "advanced features/replication")
    • INDICO + CDS to see whether at least their top-level project directories can be made readonly+replicated.

 

 

There are minutes attached to this event. Show them.
The agenda of this meeting is empty