US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2020-07-22T13:00:00-04:00
End: 2020-07-22T14:45:00-04:00
Location: No location set

Wednesday 22 Jul 2020, 13:00 → 14:45 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:10 → 13:20
  OSG-LHC 10m
  
  Minutes
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  OSG 3.5.20
  
  HTCondor-CE 4.4.0
  
  Frontier Squid 4.12-2
  
  CVMFS 2.7.3
  
  scitokens-cpp 0.5.1
  
  Other
  
  ATLAS XCache containers based on XRootD 5 are available for testing (tagged as 'upcoming-fresh' and 'upcoming-timestamp')
  
  OSG All Hands virtual meeting Aug 31 - Sep 3 (https://opensciencegrid.org/all-hands/2020/)
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    Kubernetes Activities at SWT2 15m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    100722_Kubernetes_UTA.pdf
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Minutes
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  normal operation in general
  
  RPVLL reprocessing started last Thursday. BNL staging rate, as shown on DDM dashboard, is 2~3GB/s. A temporary glitch between HPSS and dCache affected retrieval of staged files from HPSS disk cache to dCache DATATAPE, for a couple of hours, fixed by restarting HPSS batch, root cause under investigation.
  
  40 new WNs added to the T1 farm
  
  Supermicro SYS-6019U-TR4 servers. 2 x Xeon Cascade lake 6252 CPUs (96 logical cores total). 12 x 16 GB (192 GB total) DDR4-2933 MHz DIMMS. 4 x 2 TB SSDs. 2 x 1 Gbps LACP link. 1141 HS06 per node.
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Minutes
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Tier 2 Notes:
  
  Please complete your additional funds requests and site reporting before the COB today. (SWT2 missing in https://docs.google.com/document/d/170cSG0BVKzntxVaa0gGrj1TClmwd4FQAUQMAxFEM2Hw/edit# )
  
  Problems over the last 2 weeks:
  
  AGLT2: dCache issues - working to find a good version. Transfer success rate ~40%.
  
  The 40% success rate was apparently monitoring issue with the new Site Oriented Dash board. The plot shows 90%-100% over the last week today and looks much different than it did yesterday.
  
  MWT2: Troubles transfers with largest issue traced ipv6 issue at UIUC. Transfer issues remain at a low level after the ipv6 issue was fixed. (All sites have a low rate of transfer issues.)
  
  NET2: Ongoing transfer issues.
  
  SWT2: OU had me memory issues and is implementing cgroups. DNS issue at UTA affected CPB
  
  External issues particularly a DNS issue at CERN lowered production.
  
  Had a nice conversation with Lincoln, Johannes, and Ofer last Friday about preserving the error information and putting it into the log when a Rucio transfer fails. Johannes gave some good ideas for debugging Rucio transfer issues. We are checking to see if the missing error information/logs turn up as dark data.
  - 13:40
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    Tickets:
    
       open 147805: Continued issues with dcache (see previous report for details)
       Tried installing 5.2.24 when it became available.
       This caused a high rate of transfer errors where the ftp connection is dropped,
       seemingly after the transfer has been negotiated but before the first byte of payload is sent.
       We downgraded back to the patch version of 5.2.22 where we still see that issue but with lower rate.
    
       closed 147784: catching up on updates for squid servers
    
       closed 147769: files not accessible. One dcache server had most of its pools offline.
    
    Services:
    
       updating AFS servers to CentOS7, ongoing.
    
       BOINC: incremental improvements
    
       Condor: some misbehaving T3 jobs used more memory than should have been allowed ~10G instead of 2G
       and caused ~50 worker nodes to become unresponsive.
       The condor configuration on the submit nodes was updated to protect against this problem.
    
    Working on updating/securing ELK at AGLT2. Complete except that base OS is SL6 and ELK 7.8 needs CentOS 7+
    
    Hardware:
    
    Ordered 26x C6420s (20 for UM, 6 for MSU) and 7x R740XD2 (5 for UM, 2 for MSU)
  - 13:45
    MWT2 5m
    
    Minutes
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    UC
    
    Analytics cluster upgraded to Elasticsearch 7.8
    
    Benchmarked Dell nodes for new purchase
    
    IU
    
    Working on IPv6 configuration
    
    UIUC
    
    ICC Quarterly PM July 15. All worker nodes updated to the latest kernel and GPFS client
    
    IPv6 issues on a number of workers were causing problems connecting to both CERN and the UC storage. Fixed after reboots
  - 13:50
    
    NET2 5m
    
    Minutes
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Smooth operations, no tickets, site is full... other than low level DDM timeouts, mainly to NL cloud.
    
    NESE_DATADISK now used for job staging as well as general I/O.
    
    Planning for trip to MGHPCC to replace fans, broken disks, re-cable NET2 6PB in NESE for expanding NESE_DATADISK.
    
    NESE Tape Tier solution will be IBM TS4500 (same as BNL). Configs and quotes are close to finalized. Space power and cooling are being prepared at MGHPCC. Pod will be dedicated to tape libraries. Large enough to hold 4 18' libraries. Neighboring pod will hold front end system and ATLAS DDM nodes. Protocols will be posix (GPFS) with the file system also covered as S3.
    
    We've been in touch with Lincoln, re: SLATE. No particular security issues are a problem for BU Research Computing. Following Lincoln's instructions and then will likely have a session with Lincoln to get things going.
    
    We've ordered 16 new AMD worker nodes from DELL.
    
    Additional infrastructure requests set up in Shawn & Fred's document.
  - 13:55
    SWT2 5m
    
    Minutes
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    No significant issues over the last two weeks
    
    Previous issue with campus/Internet DNS when changing registrar for altas-swt2.org domain
    
    Working in earnest on deploying SLATE machine, with Lincoln's help
    
    OU:
    
    - Overall no problems, running well
    
    - Today OSCER maintenance
    
    - OSG downtime apparently not propagated to WLCG, investigating
    
    - SAM3 CE tests submitted without maxWallTime, causing them to be submitted with UNLIMITED WallTime to SLURM, causing timeouts because of scheduled cluster maintenance window. Opened GGUS ticket, will be fixed by SAM developers.
    
    - Benchmarked Gold 6230 with a lot of Fred's help: 946.39 total, for a benchmark of 11.83 HS06/HT-Core
- 14:00 → 14:05
  WBS 2.3.3 HPC Operations 5m
  
  Minutes
  
  Speakers: Doug Benjamin (Duke University (US)), Lincoln Bryant (University of Chicago (US)), lincoln bryant
  
  HPC Integration Updates @ TACC
  
  HPC Integration Updates @ TACC.pdf
  500,000SU allocation on Frontera
  
  (1 SU = 56 physical Xeon Platinum cores * 1 hr)
  
  Jobs execute without CVMFS, running athena:21.0.15_31.8.1-noAtlasSetup container
  
  ALRB setup and maintained via Cron on the login nodes
  
  Have been working to understand best job "shape" for optimal throughput
  
  Testing number of parallel nodes (1, 5, 10, 20, 50, 100) and varying number of events (250, 500, 1000)
  
  Overall: TACC is working, slowly ramping up utilization & consulting with TACC support as we go.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    ATLAS ML Platform & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Minutes
  
  Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
  Fairly smooth operations (minor dCache issue at T1, various at T2, CERN DNS and FTS problems)
  
  Everyone should be using new SSB dashboard
  
  Thanks to everyone who worked on PQ unification
  
  Discussed Rucio error logging issue with Fred, Johannes, Lincoln; gathering info for follow-up
  
  Updates to downtime declaration procedure
  
  SLATE tutorial at PEARC next Friday: https://pearc20.sched.com/event/cnXu
  
  Working on quarterly report
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-7_1_20.pdf
    
    US-cloud-summary-7_15_20.pdf
    
    US-cloud-summary-7_22_20.pdf
    
    US-cloud-summary-7_8_20.pdf
  - 14:25
    
    Analytics Infrastructure & User Support 5m
    
    Minutes
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    ES upgraded to 7.8.0
    
    Working on cross monitoring of clusters at UC and UM. Still not finished.
    
    Added Tier0 Oracle data indexing. Will take a week to setup everything.
    
    Issue with Panda Oracle DB. Two days worth of data missing from it.
    
    Regular helping ML platform users.
  - 14:30
    
    Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
    
    Minutes
    
    Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    
    xcache at v5.0.0-1.
    
    Issues with g-stream. Hopefully will be fixed in a month or so.
    
    Rewrote my cinfo reporter to support v3 of cinfo files.
    
    will be trying new OSG images produced yesterday. They should address several important issues.
    
    VP running but queues still in brokeroff.
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Share this page

Direct link

Social networks

Calendaring