US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2020-04-29T13:00:00-04:00
End: 2020-04-29T14:45:00-04:00
Location: No location set

Wednesday 29 Apr 2020, 13:00 → 14:45 US/Eastern

- 13:00 → 13:10
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
  US ATLAS Computing Facility Capacity Spreadsheet: https://bit.ly/usatlas-capacity
  
  Through March 2020 (FY20Q2):
  
  V52: CPU capacity increments & retirements
  
  WLCG-v52: Pledge figures from REBUS available (fill in as needed)
  
  WLCG-v52, Table 1: Installed storage capacity
  
  WLCG-v52, Table 2: FY20 Procurement plans
  
  WLCG-v52, Table 3: Retirements
  
  WLCG-v52, Table 4: AUX equipment (non-CPU, non-disk)
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  3.5.16 and 3.4.50
  
  Frontier Squid
  
  XRootD 4.11.3-1.2 for 3.4 (already released in 3.5), including a fix for a core dump seen at OU
  
  HTCondor 8.8.9 and 8.9.7
  
  Other
  
  We've built the osg-wn-client and relevant packages for EL8!
  
  XRootD 5 RC and plugins have successfully passed internal tests
- 13:20 → 13:40
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    OSG-LHC Technical Roadmap 20m
    
    Speaker: Brian Paul Bockelman (University of Nebraska Lincoln (US))
    
    USATLAS-Facilities-2020-Upgrades.pdf
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  Working fine
  
  dCache upgrade scheduled in 3 weeks
  
  Intel CPU delivered end of June
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  For all the sites that see small percentage of jobs fail with timeouts on input/output:
  
  we are investigating interaction between rucio mover, gfal2 and xrootd. In a number of cases actual transfer was not even attempted and the reason seems to be the way rucio mover tries to stat file and get checksum. Hopefully fix will come soon, once ready we will try to get it expressly tested and deployed. This does not exclude possibility there are other issues lurking there.
  
  Fred:
  
  It was an OK week for production.
  
  There were a number of tasks that had high failure rates but from the submission side.
  
  Most recently in the last day looping event generation jobs that killed as a group.
  
  I was going to mention the Rucio transfer issue but Ilija beat me to it by providing the notes above.
  
  The was also an unintended Rucio release which caused trouble for about 1 day.
  
  Several sites had short-term issues.
  
  Covid jobs seemed to run OK but of course reduced ATLAS production.
  
  NET2 had some stage-out issues with the covid jobs.
  
  Looks like recovering just over a month (Feb 28 to Apr 8) of accounting data for CPB will be hard. Right now CPB is not reporting anything to the official GRACC/APEL system for the entire month of March.
  
  Port scanning form LHCONE????
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    incidents:
    
    21th April, one of our new R740x2d dcache server died, the daughterboard was burnt, we got it replaced within 48 hours with dell sending an onsite technician. Before that, we submitted a JIRA ticke to declare the unavailability of the files.
    
    Services:
    
    We still see jobs get killed due to OOM, 200 jobs/2 weeks. This mostly happens to work nodes with less than 2GB/core, we are in the process of 1) adding more memory to work nodes with retired parts 2) disable HT for work nodes witout spare DIMM parts.
    
    We see 60% of the cluster is being used by the analysis jobs, this might be caused by our recent reconfigurtion of condor and gatekeeper in order to balance giving enough cores to covid-19 jobs and having less fragementation in condor cores. Too many analysis jobs seem to increase the failure rate of jobs in the site.
    
    Condor is updated to 8.8.8
    
    Hardware:
    
    Retired 20TB usable space from dCache to get spare parts to cover the storage enclosures not under warranty anymore.
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Fixing storage issues at UC. Two of our older out-of-warranty servers have been having controller issues. Currently draining the pools that are still online and trying to recover data from the pools that are failing
    
    The root disk on the UC gatekeeper filled up, causing job failures this morning
    
    NVIDIA drivers updated on the ML platform
    
    LOCALGROUPDISK filled up last Friday. Cleanup ongoing, now down to 97% full
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
  - 13:55
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA_SWT2:
    
    GGUS-Ticket-ID: #146691 Concerns SAM test for Frontier setup. Only the test is affected, jobs are fine. Test probably needs to be updated.
    
    Ramping up OSG Covid-19 jobs
    
    SWT2_CPB:
    
    GGUS-Ticket-ID: #146694 Same issue as seen above.
    
    GGUS-Ticket-ID: #146387 now closed.
    
    Met with networking staff for IPV6 discussions. They are evaluating options before committing to timeline.
    
    OU:
    
    - Not much, all running well
    
    - Upgraded xrootd to 4.11.3, which fixed space reporting and logging, and were able to delete some old data from OU_OSCER_ATLAS_LOCALGROUPDISK
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Doug Benjamin (Duke University (US))
  
  issues with credential expiration delayed processing at NERSC but NERSC is running again. Did ramp up over 2K jobs.
  
  Work continues in integrating TACC. Lincoln and Doug will work together tomorrow.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    ATLAS ML Platform & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    Running smoothly.
    
    Opportunistic folding at home got us to sixth place:
    
    https://stats.foldingathome.org/team/38188
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
  
  Ofer, Fred, Johannes will meet on Friday to follow up on Fred's report from last week and discuss monitoring & procedures for the US cloud.
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-4_22_20.pdf
    
    US-cloud-summary-4_29_20.pdf
  - 14:25
    
    Analytics Infrastructure & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    ES running smoothly.
    
    Changes in collection at CERN. Added Jedi task parameters data source. Very complex but gives possibilities we did not have before. Ivan and Mayuko are working on it. Now investigating site exclusion by users.
  - 14:30
    
    Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
    
    Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    
    Slowly ramping up with XCaches and VP.
    
    AGLT2 - replaced their node with the new one with more storage. Change them to direct access.
    
    Prague - running smoothly. Will upgrade further this or next week.
    
    LRZ - issue with the clean up, managed to cross HWM.
    
    ROOT TChain bug discovered and fixed. Waiting for the LCG build to get it in production.
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility