US HPC Facilities Meeting

US/Central
    • 11:00 11:15
      Harvester 15m
      Speaker: Tadashi Maeno (Brookhaven National Laboratory (US))
    • 11:15 11:30
      OLCF Update 15m
      Speakers: Danila Oleynik (Joint Institute for Nuclear Research (RU)), Sergey Panitkin (Brookhaven National Laboratory (US))
    • 11:30 11:50
      NERSC & ALCF 20m
      Speakers: Doug Benjamin (Duke University (US)), Taylor Childers (Argonne National Laboratory (US))

      Attendees: Kaushik, David Lesney, Sergey, Xin, Wei

      OLCF/Titan:

      Working on understanding IO and metadata IO problem on Titan. Seen O(10^3) metadata IO at the first few minutes of a job.

      Mitigations: Put multiple copies of DB releases on RAM disk. The access time to those DB releases sqlite files now range from 7-60s, with up to 800 ranks.

      At Titan, panda/pilot daemons running inside OLCF and reach out to BNL, etc. to fetch data. So only need authentication with the outside services.

      NERSC/Cori:

      Work well. Doug found a sweet point of running on ~1000 KNL nodes (136 logical cores each). Beyond that we saw the jobs running time exploded from ~4h to >20h. Vakho's study showed that some of these are related to ""I/O finalization" of PoolSvc". 

      We are running out of space at NERSC_DATA. Less than 2.5TB left (out of 30TB). No data transfer out and no deletion over that last 36h. Contacted DDM to ask for deletion. No response yet.

      Wei tested running a jobs with container with CVMFS dump in the shared filesystem (they are bind mounted at /cvmfs in the container). This use the latest production pilot. It works except the very last step - the pilot was confused by a mismatch in NERSC_DATA GridFTP endpoint without port number. Will fix this in AGIS when the current production campaign quiet down. However, not so lucky when trying to use the Burst Buffer. Investigating.

      BNL:

      Xin is exploiting the possibility of using the BNL HPC (knl). The batch nodes won't have outbound TCP. The interactive nodes have port 22 (ssh), 80 (http) and 443 (https) opened only. Working with Sergey to import the pilot launch mode from Titain.