US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2020-04-15T13:00:00-04:00
End: 2020-04-15T14:45:00-04:00
Location: No location set

Wednesday 15 Apr 2020, 13:00 → 14:45 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  3.5.16/3.4.50
  
  XRootD 4.11.3 in OSG 3.4
  
  Blahp and Gratia probe fixes
- 13:20 → 13:50
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    COVID-19 GPU contributions with the ML platform 15m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    Folding @ ATLAS
    
    Folding@Home.pdf
  - 13:35
    
    Action Items: Communications/Site Monitoring (POSTPONED) 15m
    
    Speaker: Fred Luehring (Indiana University (US))
    
    Action Items_ Communications_Site Monitoring.pdf
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  finished HTCondor upgraded to 8.8.8, on the CEs and the farm. A negotiator bug was triggered by the upgrade, which starves mcore jobs. Workaround in place now, production back to normal level.
  
  COVID-19 jobs have ramped up at BNL, 9k+ running jobs now, surpassing other sites in the past couple of days in OSG.
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  Sites running pretty good this week but some interference with production by Condor security upgrade.
  
  Giving some resources to COVID-19 effort which is affecting the number slots available to ATLAS production (expected).
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    1. Condor update:
    The main goal is to update everything to 8.8.8 to address the security issue.
    
    1.1) We started a big project of rebuilding all 400 work nodes of the condor cluster
    One motive for rebuilding was to separate the partition used by condor jobs from the tmp partition
    This also performed the update them from 8.6.13 to 8.8.8/8.8.7-1.
    We started with the UM site and finished rebuilding all the work nodes (179),
    2/3 of UM nodes are running 8.8.7-1, and 1/3 8.8.8, depending on the rebuild day.
    The WNs at MSU are now rebuilding in batches, about 1/3 yesterday, another ~1/3 today,
    the rest Thursday and Friday.
    
    1.2) updated (switched) the condor head node from sl6 /8.6.13 to sl7 /8.8.8
    
    1.3) During the update of the main gatekeeper, we encountered a problem.
    Idle ucore jobs did not get scheduled to unclaimed cores,
    this was solved by updating the head node to 8.8.8
    and also add a workaround to the negotiator
    (to address a possible bug in the negotiator in 8.8.8)
    
    2. Job failures caused by OOM killer.
    
    This is very likely caused by
    
    a) there are high memory pile up jobs (a single job use 56GB memory at peak)
    running on our score queue (2 GB/core)
    
    b) our site also has BOINC jobs running which use extra memory on the work nodes.
    
    To address this issue, we stopped the BOINC jobs.
    Now that we solved the problem caused by condor update,
    it is a good time to monitor if the same error still exist.
    Fred has been in contact with ADC to ask if possible
    to put the pile up jobs in high mememory queque.
    
    c) BOINC jobs will be suspended until we understand more about the situation.
    
    3. AGLT2 started covid19 jobs from last Wednesday
    We gave them a quota up to 2000 cores, this can be expanded to 5000,
    For now we do not see enough queued jobs to our site,
    the average number of covid19 jobs we process is around 800.
    
    4. Ticket 146371
    
    Weird problem about small set of files accessible via xrootd but not gsiftp.
    Restarting dcache on pool node fixes it for a short time.
    Shawn opened a ticket with dcache.
    No resolution yet
    
    5. COVID19.
    
    No change to access plan at UM or MSU
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    ICC PM today to apply GPFS client and network updates
    
    In the process of adding IPv6 to the UC workers. Workers are all configured. PTR records added Monday. Still need to add AAAA records
    
    Upgraded condor to 8.8.8-1.osg35
    
    Updated all workers to use the OSG rolling release
    
    Added COVID-19 job routes to MWT2 for running OSG COVID-19 jobs
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Smooth operations except for high temp alarms due to broken fans. Replacement fans ordered.
    
    Site was not getting filled by PanDA for a few weeks, but it's better now.
    
    Two more NESE gateways added anticipating ramping up. Working as NESE_DATADISK in AGIS & Rucio.
    
    6PB NESE upgrade arrived, installed, tested, but switches from DELL have been delayed twice.
    
    Converging on NET2/NESE tape Tier. Getting helpful feedback from BNL and others in HEP.
    
    Fred noticed that a few of our oldest nodes were getting a ~50% failure rate, strangely, from stage-out timeouts. The problem quickly disappeared, but we haven't yet figured out what the cause was.
    
    User complained about 5 missing files at NET2_DATADISK. They were indeed missing, marked as gone by DDM. Don't think it's related to a local issue.
  - 13:55
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA_SWT2:
    
    Running well
    
    Offered cores to OSG vo for COVID-19 jobs. Awaiting response
    
    SWT2_CPB:
    
    Running well
    
    Issues with GGUS ticket 146387 are no longer occurring but waiting to see if they come back with different job mix
    
    Issue found with latest gratia slurm probe. Trying to get corrected data in gracc to forward to APEL.
    
    Backup generator test for facility is scheduled for tomorrow night.
    
    OU:
    
    - Nothing to report, all running well.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speaker: Doug Benjamin (Duke University (US))
  
  NERSC switched over to use the ALCC allocation this morning.
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    Shared pool dedicating ~30% resources to covid research, but T3 unlikely to be affected given scale.
    
    BNL Email -- DOE-wide blocking of non-MFA IMAP!! Users will have to tunnel to BNL or use webmail (with MFA setup): https://webmail.rhic.bnl.gov/
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    ATLAS ML Platform & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    Working fine. Running F@H opportunistically, roughly 7-8 GPUs. Had one issue with k8s cluster upgrade.
    
    Will upgrade drivers and k8s on next Monday. k8s move from 1.15 to 1.18 so a big change. Will have to review all the things we are running on it.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-4_15_20.pdf
    
    US-cloud-summary-4_8_20.pdf
  - 14:25
    
    Analytics Infrastructure & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    Running fine.
    
    ESnet data now coming at a low rate, will go full throttle once we confirm everything is in the way we want it.
    
    Perfsonar data replayed from tape until Jan 2019.
    
    Developed dcache log filebeat based ingest. Let me know if you need it.
  - 14:30
    
    Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
    
    Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility