EOS DevOps Meeting

Name: EOS DevOps Meeting
Start: 2017-08-22T16:00:00+02:00
End: 2017-08-22T17:45:00+02:00
Location: CERN

Tuesday 22 Aug 2017, 16:00 → 17:45 Europe/Zurich

513/R-068 (CERN)

513/R-068

CERN

Show room on map

Jan Iven (CERN)

Description

Weekly meeting to discuss progress on EOS rollout

Hide

● production instances

ALICE updated to 0.3.265 (balancing bug; 0.3.266 would be latest), balancing still off until files have been recovered. Have script to recover for CMS (will keep apart for T0, propose to sneak back in place for group+user files).

Will need to run same exercise for EOSATLAS, EOSALICE.

Have no logs on EOSATLAS on the headnode (?? might need to retrieve from CASTOR. MGM machines changed in March - perhaps manually "cleaned up" a /var-full alarm).

Investigating a file loss on ZENODO.

Online compact tool has bug that can get triggered on 2./3rd compact, under writes, this crashed EOSUSER. Fixed but not yet tagged.

Still have compact "order dependency" that will "hide" files on the slave. There no longer is a real reason to compact only files and not directories, so could simply disallow selective compacts. Or add some safeguard (compaction timestamp )? Whole business will be gone with new namespace. Decide: only update docs.

● CERNBOX and EOSUSER

investigating issue for "webcast" user on EOSUSER (stat() order? accounting bug? heavy writers, 13TB/day).

● FUSE and client versions

4.1.26 was tagged to production today.

● Citrine rollout

Need to update EOSPUBLIC to 4.1.26 (yum createrepo bug).

EOSLHCB (MGM) is already on 4.1.26.

● SWAN

one "eosd" crash (on old version). Have one test node on 4.1.25 but service prefers to keep a stable version running for as long as possible.

"Unified" kerberos principal deployed on SWAN, have go-ahead for MGM (prefer EOSPUBLIC to go first)

● nextgen FUSE

"more positive":

found major bug, one race condition is left.
Have implemented active notification that client is still using a directory (once a client has subscribed to changes), will keep extending for 5min-windows.

Discussion - how to merge this? has hooks all over the place (on commit, to notify), in theory unused but of course could have a bug. Suggest to merge all outstanding "fixes", tag, the merge new (server-side) FUSE interface.

Also debugging a memory leak (not found), but instead saw a 10sec-delay from Xrootd code that needs fixing (global lock on close(), delays subsequent open() - also for read; affects checksumming, protects FileHandleTable). Code is on FST but affects FUSE, might ask for patched 3.3.6 - 4.X does something differently).

Also logrotating on MGM cause a similar lock to be held (copytruncate, needs to copy the full file)?

Plain xrootd logrotate had no performance issue (just closes+reopens), but 3.X still has the summertime bug.. 4.X has new internal logrotate but currently not used - decide: try this out on EOSPPS.

● new Namespace

512GB+SSD machines: converts (full) EOSALICE namespace in 3.5h

Will store only UID in ACLs. Discussion: who should translate? ("eos" could ask the server to translate, but FUSE can't). Plan to do this at conversion time? Or before (in which case would need to have the client-conversion). Could pre

Working on 'atomic' ACL change.

Q: has Giorgios merged the write speedup? No. Once back (2 weeks), OK to go to EOSBACKUP (at least for conversion test).

● Backup

will update EOSBACKUP once 0.3.267 is tagged.

will add the missing TPC flag.

● Samba

Latest SAMBA update breaks with EOS. Under investigation.

● Xrootd

4.7 release candidate still under test by CMS, expect to fully release Monday. Not tested at CERN by Dan since build was broken. One crash seen, but seems to be EOS-side.

Q:has the GSI perf fix been included? unclear, will check.

Kuba had 3 things (with 4.6).

One config issue (missing "-f" for TPC - but EOS-to-EOS should actually use "eoscp"?)
missing source file on TPC - failed with "redirect limit has been reached", real error is printed in the dump?
unable to mkdir() - parallel transfers might create directories in parallel (client sets flag, server fails the open()) - looks like a server-side (EOS -> JIRA)

There are minutes attached to this event. Show them.

- 16:00 → 16:05
  
  overall 2017 planning 5m
  
  Speaker: Jan Iven (CERN)
- 16:05 → 16:30
  operations: production
  - 16:05
    
    production instances 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    ALICE updated to 0.3.265 (balancing bug; 0.3.266 would be latest), balancing still off until files have been recovered. Have script to recover for CMS (will keep apart for T0, propose to sneak back in place for group+user files).
    
    Will need to run same exercise for EOSATLAS, EOSALICE.
    
    Have no logs on EOSATLAS on the headnode (?? might need to retrieve from CASTOR. MGM machines changed in March - perhaps manually "cleaned up" a /var-full alarm).
    
    Investigating a file loss on ZENODO.
    
    Online compact tool has bug that can get triggered on 2./3rd compact, under writes, this crashed EOSUSER. Fixed but not yet tagged.
    
    Still have compact "order dependency" that will "hide" files on the slave. There no longer is a real reason to compact only files and not directories, so could simply disallow selective compacts. Or add some safeguard (compaction timestamp )? Whole business will be gone with new namespace. Decide: only update docs.
  - 16:10
    
    CERNBOX and EOSUSER 5m
    
    Minutes
    
    Speaker: Luca Mascetti (CERN)
    
    investigating issue for "webcast" user on EOSUSER (stat() order? accounting bug? heavy writers, 13TB/day).
  - 16:15
    
    FUSE and client versions 5m
    
    Minutes
    
    Speaker: Dan van der Ster (CERN)
    
    4.1.26 was tagged to production today.
  - 16:20
    
    Citrine rollout 5m
    
    Minutes
    
    Speaker: Herve Rousseau (CERN)
    
    Need to update EOSPUBLIC to 4.1.26 (yum createrepo bug).
    
    EOSLHCB (MGM) is already on 4.1.26.
  - 16:25
    
    SWAN 5m
    
    Minutes
    
    Speaker: Jakub Moscicki (CERN)
    
    one "eosd" crash (on old version). Have one test node on 4.1.25 but service prefers to keep a stable version running for as long as possible.
    
    "Unified" kerberos principal deployed on SWAN, have go-ahead for MGM (prefer EOSPUBLIC to go first)
- 16:30 → 16:50
  development: near-term
  - 16:30
    nextgen FUSE 5m
    
    Minutes
    
    Speaker: Andreas Joachim Peters (CERN)
    
    "more positive":
    
    found major bug, one race condition is left.
    
    Have implemented active notification that client is still using a directory (once a client has subscribed to changes), will keep extending for 5min-windows.
    
    Discussion - how to merge this? has hooks all over the place (on commit, to notify), in theory unused but of course could have a bug. Suggest to merge all outstanding "fixes", tag, the merge new (server-side) FUSE interface.
    
    Also debugging a memory leak (not found), but instead saw a 10sec-delay from Xrootd code that needs fixing (global lock on close(), delays subsequent open() - also for read; affects checksumming, protects FileHandleTable). Code is on FST but affects FUSE, might ask for patched 3.3.6 - 4.X does something differently).
    
    Also logrotating on MGM cause a similar lock to be held (copytruncate, needs to copy the full file)?
    
    Plain xrootd logrotate had no performance issue (just closes+reopens), but 3.X still has the summertime bug.. 4.X has new internal logrotate but currently not used - decide: try this out on EOSPPS.
  - 16:35
    
    new Namespace 5m
    
    Minutes
    
    Speaker: Elvin Alin Sindrilaru (CERN)
    
    512GB+SSD machines: converts (full) EOSALICE namespace in 3.5h
    
    Will store only UID in ACLs. Discussion: who should translate? ("eos" could ask the server to translate, but FUSE can't). Plan to do this at conversion time? Or before (in which case would need to have the client-conversion). Could pre
    
    Working on 'atomic' ACL change.
    
    Q: has Giorgios merged the write speedup? No. Once back (2 weeks), OK to go to EOSBACKUP (at least for conversion test).
- 16:50 → 17:45
  other: pilot services, long-term dev, external
  - 16:50
    
    Webservice 5m
    
    Speaker: Luca Mascetti (CERN)
  - 16:55
    
    Backup 5m
    
    Minutes
    
    Speaker: Luca Mascetti (CERN)
    
    will update EOSBACKUP once 0.3.267 is tagged.
    
    will add the missing TPC flag.
  - 17:00
    
    Samba 5m
    
    Minutes
    
    Speaker: Luca Mascetti (CERN)
    
    Latest SAMBA update breaks with EOS. Under investigation.
  - 17:05
    
    $HOME structure 5m
    
    Speaker: Luca Mascetti (CERN)
  - 17:10
    
    BATCH integration 5m
    
    Speaker: Massimo Lamanna (CERN)
  - 17:15
    Xrootd 5m
    
    Minutes
    
    Speaker: Michal Kamil Simon (CERN)
    
    4.7 release candidate still under test by CMS, expect to fully release Monday. Not tested at CERN by Dan since build was broken. One crash seen, but seems to be EOS-side.
    
    Q:has the GSI perf fix been included? unclear, will check.
    
    Kuba had 3 things (with 4.6).
    
    One config issue (missing "-f" for TPC - but EOS-to-EOS should actually use "eoscp"?)
    
    missing source file on TPC - failed with "redirect limit has been reached", real error is printed in the dump?
    
    unable to mkdir() - parallel transfers might create directories in parallel (client sets flag, server fails the open()) - looks like a server-side (EOS -> JIRA)
  - 17:20
    
    AOB 5m

Choose timezone