US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2020-05-13T13:00:00-04:00
End: 2020-05-13T14:45:00-04:00
Location: No location set

Wednesday 13 May 2020, 13:00 → 14:45 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  OSG 3.5.16 // 3.4.50 (this week)
  
  HTCondor 8.8.9
  
  XRootD 4.11.3 in OSG 3.4
  
  Updated vo-client with EIC VO credentials
  
  Miscellaneous
  
  OSG Security wants to include ATLAS in their May exercise but has had trouble getting a hold of Shigeki. Do we have the right email? misawa AT bnl.gov
  
  The latest CA certificate release (osg-ca-certs 1.88, igtf-ca-certs-1.106) contain major updates to the InCommon CA chain to accommodate the expiring root CA. If there are any authentication issues, ensure that both the client and server have updated to the latest versions
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 15m
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  normal operation overall
  
  increased job slots for HIMEM jobs
  
  issue with HIMEM jobs : cause low utilization of CPU slots (only 10% of the site was used)
  
  HTCondor schedules jobs based on CPU+RAM requirements.
  
  BNL procures computing resources based on 2GB RAM / job slot
  
  so HIMEM jobs (single core and 8GB RAM) will cause some CPU slots not being able to be used
  
  discussion with ADC : after GU, ADC can set a limit on number of HIMEM jobs to run on a site.
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  It was a busy 2 weeks:
  
  Smooth running generally with a few hick-ups. Some issues:
  
  AGLT2 ipv6 issue caused jobs to occasionally fail to reach squid cache.
  
  Older storage node at MWT2 having real trouble.
  
  NET2/OU Got a tickets for transfers with UK failing - not necessarily a US issue.
  
  The number of open tickets is creeping up.
  
  Spent some time trying to find a better monitoring plot to see if a site was seeing errors that other sites did not. Ofer will probably mention this.
  
  Still working on stopping jobs from being retried up to 40 times.
  
  Covid jobs continue running well.
  
  Lots of meetings about the Tier 2 pre-review review.
  
  Deadline is May 22 but some answers will probably come later.
  
  As Rob says fill out your info in the Google sheet (including myself!)
  
  Could we support some chat room or similar for the admins at all US sites could get quick answers from each other? Again something for Ofer...
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    Tickets:
    No current ticket
    Except for SRR issue globally on hold
    
    Current investigation:
    (Thanks to Fred for noticing and digging into every oddity.)
    Investigating occasional but significant rate DNS/URL problem of jobs failing a DNS/URL lookup for squid.aglt2.org
    This problem seems to happen almost exclusively at AGLT2. Error code 65.
    warn [frontier.c:1014]: Request 1278 on chan 7 failed at Tue May 12 09:05:42 2020: -6 [fn-urlparse.c:178]: host name squid.aglt2.org problem: Temporary failure in name resolution
    Frontier python code calls getaddrinfo
    Added ipv6 to the round-robin DNS for squid.aglt2.org to match ipv4
    Added access rule to our squids, for our ipv6 address space.
    But transient errors are difficult to corner.
    Ongoing. No clear answer yet.
    
    Software
    - preparing to update to condor 8.8.9 when available in OSG release
    - preparing for renewal of all our SSL certificates
    
    Hardware:
    - reconfigured all smaller nodes with spare memory to minimum of 2G per HTcore
    
    BOINC
    - no longer running boinc on WNs with <= 2G/core, only on >=2.6G/core
    - we also only re-enabled boinc on 1/2 these larger nodes to allow comparison.
    - We also changed the boinc processes' initial OOM score.
        1000 is the highest we can give to boinc jobs.
        800 was assigned to condor jobs by condor starter.
        Score evolves as oom_score=10x(percentage of memory used) + initial score.
        Thus a condor job would have to use 20% of all memory to pass a boinc job.
        This might be plausible on 8-core nodes but much less on a 40-core node.
        Unless, of course,the job has a memory leak and thus should indeed be killed first.
    - Since implementing the steps above we have only seen 4 instances of OOM kills:
        3 were on nodes not running boinc,
        all were (or would have been) legitimate oom killing of misbehaving processes
    
    covid
    - asked for increased time limit for covid jobs via osg (10h ->36h)
        as a large fraction of jobs was starting to fail to complete.
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    GGUS Ticket 146935:
    
    Missing files (6 total) in dataset. We investigated and don't see them anywhere on our disk. Going to claim them as lost.
    
    Certificate issue (Brian mentioned). Fixed with a gPlasma and certificate restart, at least jobs with file transfers aren't completely failing.
    
    Asked to lower COVID job pilots to our site. They were taking up a lot of space in place of ATLAS production.
    
    UC:
    
    Had two failing storage nodes. Draining and retiring them. Currently working on trying to get the last bit of storage (~3TB of data) offloaded from the remaining pool.
    
    Continuing to update storage nodes from el6 to el7.
    
    IU:
    
    Running fine. All clear.
    
    UIUC:
    
    New hire, Nishant, recently brought on. Going through onboarding for UIUC/ICC, and will soon be run through MWT2 onboarding.
    
    All clear.
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Smooth operations except for a first problem on the NESE side and a couple of minor problems. NESE recovered and investigating.
    
    Still no DELL S5048-ON switches, delaying NESE storage upgrade.
    
    Mostly through with rolling linux kernel update.
    
    Preparing materials for T2 review.
    
    Low level authentication issue to understood as of this meeting, will update CAs to fix as Horst did.
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Not much to report, site full and running smoothly.
    
    - Had a few failed jobs because a SLURM reconfiguration inadvertently killed running jobs.
    
    - DDM transfers had issues to some UK sites, most likely related to obsolete CA files on both sides. Updated at OU. Ticket 146909 closed.
    
    UTA:
    
    - Fixed job accounting issue at SWT2_CPB (not sure if it's a permanent solution).
    
    - We were waiting on feedback from ADC ops regarding an issue with a failing SAM squid test (already fixed at UTA_SWT2). Ticket will be closed later today.
    
    - Both SWT2_CPB & UTA_SWT2 generally running o.k.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Doug Benjamin (Duke University (US))
  
  15 M events simulated at NERSC this past week.
  
  NERSC ALCC allocation charged almost 21 Mhrs out of 36M (41.6% remaining). Hopefully we can effectively use this allocation up this month.
  
  Will need to switch over to ALCC allocation as soon as possible.
  
  Lincoln is working on data handling testing between MWT2 and TACC
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    Normal operations
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    ATLAS ML Platform & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    All working fine.
    
    Most users are doing Perfsonar analytics and some DNN work.
    
    Opportunistic use of the ml platform for Covid-91. Finished 54387 work units: now sixth in CERN, 4138th globally.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
  
  New test monit-grafana dashboard that arose from discussions with Johannes about the DAOD failure rates: shows panda failure info for all US sites on one page, allowing comparison. See link here or here. Thank you Johannes and Sasha Alekseev!
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-5_13_20.pdf
    
    US-cloud-summary-5_6_20.pdf
  - 14:25
    
    Analytics Infrastructure & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    All running fine.
    
    Adding more data sources (relating to panda, networking).
    
    Will need to dockerize and k8s-ize logstash at UC.
  - 14:30
    
    Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
    
    Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    
    VP service working fine.
    
    VP panda queues show same job and CPU efficiency as regular queues.
    
    AGLT2 upgraded server works better.
    
    MWT2 - added an NVMe. Now creating a code to enable better utilization of NVMe performance.
    
    Prague - running fine.
    
    LRZ-LMU - works better. Limited by their spinning disk for metadata. Will try moving to RAM disk.
    
    Some issues with xcache reaching 100% full when at high load.
    
    Some dark data issues being investigated.
- 14:40 → 14:45
  
  AOB 5m