US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2020-04-01T13:00:00-04:00
End: 2020-04-01T14:45:00-04:00
Location: No location set

Wednesday 1 Apr 2020, 13:00 → 14:45 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  COVID-19 Research
  
  COVID-19 payloads will be submitted through the OSG VO
  
  Sites can give priority to COVID-19 OSG pilots through HTCondor-CE configuration. Specific documentation will be published and announced
  
  3.5.13 (tomorrow!)
  
  CA certificate update
  
  Maybe XRootD 4.11.3-2 (fixes an issue at OU) depending on site testing results
  
  HTCondor-CE 4.2.1: use SSL auth instead of GSI for advertising to the central collector
  
  GridFTP: includes a patch that fixes missing transfer logs
  
  3.5.14/3.4.48 (next Tuesday)
  
  HTCondor security release (HTCondor-CEs unaffected)
  
  Other
  
  Does anyone use the OSG rolling release repositories?
  
  When will the first ATLAS site upgrade to HTCondor-CE 4 (available in OSG 3.5 release) and HTCondor 8.9 (available in OSG 3.5 upcoming)?
  
  The GridFTP replacement, XRootD standalone, is ready to be piloted. We're very interested in ATLAS needs and feedback
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    Demo of Network Dashboards 15m
    
    Demo of the draft Kibana dashboards looking at our perfSONAR data.
    
    Speaker: Shawn Mc Kee (University of Michigan (US))
    
    USATLAS: OSG/WLCG perfSONAR Network Monitoring and Analytics
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  normal operation in general at T1. But in min-safe mode (access to SDCC will be permitted only for hardware failures or to fix unexpected outages until further notice)
  
  ARC-CE/gridftp interface enabled for ATLAS GPU jobs on the IC cluster. Will switch to gridftp submission mode on the CERN harvester side soon.
  
  cvmfs and HTCondor upgrade ongoing on the farm nodes, in rolling fashion.
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  The tier 2 sites running well and I have pinging sites if I see job failures, open tickets, etc. The number of open team tickets was dangerously close to zero but a flurry of activity this morning opened some more tickets.
  
  There will be a pre-review review of the Tier 2 sites in preparation for the 5 year renewal of the tier 2 program.
  - 13:40
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    Service:
    
    Updated OSG to 3.5 on 2 out of 3 gatekeepers, the ATLAS production gatekeeper is still on 3.4, waiting for condor update on our cluster.
    
    Testing and updating condor from 8.6.13 to 8.8.7, encountered some problem due to the new features of 8.8.7-1, with some workaround, fixed the problems. Also plan to rebuild all the work nodes with a seperate parition for condor jobs instead of sharing it with /tmp. The condor head node is also a SL6 node, we plan to update it to SL7 with condor 8.8.7-1.
    
    dcache is updated to 5.2.16, it took longer than what we planned
    
    had problem with the dCache database (zpool problem), took half day to recover it, HTCondor ramped down due to this and also a related ggus ticket 146141(solved) and 146144 (same as 146141, request to close)on 22nd March 2020
    
    Tickets:
    
    146371 : file transfer error with gfal-copy, but good with xrdcp still investigating. We restarted the pool, and it works for a while and then stopped work again.
    
    Hardware:
    
    finished the retirement of old storage for this last purchase cycle and until the next cycle and are updating the storage by year of purchase
    
    Access during lockdown
    
    working remotely but access to T2 equipment allowed to Wenjing, Shawn at UM and Philippe, Dan Hayden at MSU
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Site COVID-19 Site Status
    
    UC: As of today, UChicago is only permitting Essential Staff building access on a per-building basis limited to one day a week
    
    IU: No access to the IUPUI server room. Compute maintenance is on best-effort
    
    UIUC: NCSA is remote. Compute maintenance is on best-effort
    
    UC
    
    Investigating low level dCache transfer errors
    
    Added additional xrootd dCache doors
    
    Downgraded kernels on the R740xd2 storage nodes back to stock. We were running mainline to get better network performance, bonds looked better but caused 1000s of xfer errors/day
    
    UIUC
    
    ICCP will retire 3824 cores of our older worker nodes in the coming months (rows 67, 68, and 69 on the v51 tab on the USATLAS capacity spreadsheet)
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Access to MGHPCC is still allowed with scheduling and preparation. Not a major limitation for us in practice.
    
    Added two new NESE gateway nodes for gridftp transfers. NESE nodes working, working with ADC guys moving more into production. New AGIS site and BU_NESE and new NESE_DATADISK will be "nucleus" site has been created. Being tested by ADC. New storage has arrived except for a couple of management switches from DELL which have been delayed until this month.
    
    Ordering broken fans for various C6000 chassis failures.
    
    Rolling kernel updates are in process on the worker nodes.
    
    SLATE node installed (atlas-slate01.bu.edu) and first pass at installation attempted. We'll be in touch with SLATE team soon.
    
    Proceeding to prepare a large volume tape tier for NESE & NET2. Aiming for initial ~30PB storage with ~0.5PB front end. Meeting with vendors (IBM, SpectraLogic and Quantum). Want to compare notes with Xin and BNL.
    
    Smooth operations otherwise in the past two weeks except that the site isn't really getting saturated.
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    SWT2_CPB:
    
    Investigating an issue with our MD3XXXi based storage systems that shows episodic failures for staging files to worker nodes. Looking at memory pressure settings in the kernel and driver firmware updates.
    
    OU:
    
    Not much, things are running fine.
    
    Some job failures because of incorrect condor jdl files coming in from pre-production harvester instance. Being worked on.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Doug Benjamin (Duke University (US))
  
  NERSC keeps running. got 3000 nodes earlier this week.
  
  new Tasks assigned to NERSC - put them on pause till older tasks finish.
  
  Lincoln and DB met on Friday to discuss on how to go forward at TACC and transfer knowledge (and create confusion) from DB to Lincoln.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    Condor update to (security patched) version 8.8.8 rolled across the shared pool (T3 affected), only ~20% draining at once.
    
    Announced to users that BNL's "min-safe" operations (only ~5% staff on site) may affect response times but we strive to keep facility 100% operational
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    
    ARC-CE with GridFTP job submission for ANALY_SLAC_GPU is working.
  - 14:15
    
    ATLAS ML Platform & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    Running smoothly.
    
    Now opportunistically running COVID jobs through the Folding at Home platform.
    
    Postponed update of GPU drivers and CUDA versions.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
  
  In the midst of evaluating monitoring and communication tools and protocols. Have had fruitful discussions, as a group, thus far in an attempt to identify problem areas and potential solutions.
  - 14:20
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-3_11_20.pdf
    
    US-cloud-summary-3_18_20.pdf
    
    US-cloud-summary-3_25_20.pdf
    
    US-cloud-summary-4_1_20.pdf
    
    https://docs.google.com/document/d/15RunDQwHpEaEGIKbJTPxhWsbCxF-CAo8jwBmnkg29EM/edit
  - 14:25
    
    Analytics Infrastructure & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    working smoothly.
    
    Missing traces and events (both at CERN and UC). Investigating with Thomas B.
    
    Dcache logs ingresse, for now billing data (filebeats configuration, still to instrument doors and pools)
    
    Hope to start getting ESnet data next week.
  - 14:30
    
    Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
    
    Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    
    Xcache servers working smoothly
    
    At MWT2 failures from Triumf (working with Simon, Andy, Matevz on understanding the issue) and LRZ (downtime).
    
    At AGLT2 moved to ANALY_AGLT2_VP queue. Works well. Will try to ramp up in a day or two.
    
    At Prague networking issues (puppet k8s interaction), storage was 6 RAID arrays, not split in JBODs (78), new NIC (20Gbps).
    
    Will work on Munich inclusion in VP.
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility