US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2019-10-16T13:00:00-04:00
End: 2019-10-16T15:00:00-04:00
Location: No location set

Wednesday 16 Oct 2019, 13:00 → 15:00 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
- 13:10 → 13:20
  OSG-LHC 10m
  
  Minutes
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  OSG 3.5.3 and 3.4.37 (tomorrow)
  
  HTCondor 4.0.1 (3.5 only, upgrade notes)
  
  HTCondor 8.8.5 in 3.5, including new default security configuration (upgrade notes)
  
  HTCondor 8.8.5 in 3.4, without the new default security configuration above (upgrade notes)
  
  Fixes to Slurm accounting probes
  
  Other
  
  Any feedback on our community testing emails/process?
  
  No feedback but Brian expressed interest in increased ATLAS engagement
  
  ATLAS XCache has changed rather substantially and needs testing. New version available in the 'fresh' tag
  
  Wenjing, Brian, and Ilija will work together on deploying the new version
  
  Fred was wondering how BOINC hours could be accounted
  
  Since BOINC work doesn't come through the CE, the full OSG accounting workflow can't be used and development work will be required to get the data out of BOINC
  
  BrianL will discuss with the GRACC team to see if they can/are willing to accept data generated by ATLAS BOINC so that the data can be sent to APEL with the rest of the monthly accounting data
- 13:20 → 14:00
  Topical Report
  - 13:20
    
    OSiRIS ATLAS Event Service 15m
    
    Speaker: Benjeman Jay Meekhof (University of Michigan (US))
- 13:40 → 14:25
  US Cloud Status
  - 13:40
    
    US Cloud Operations Summary 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
    
    US-cloud-summary-10_16_19.pdf
    
    US-cloud-summary-10_2_19.pdf.pdf
    
    US-cloud-summary-10_9_19.pdf
    
    US-cloud-summary-9_25_19.pdf
  - 13:45
    
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
  - 13:50
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    service:
    
    3 ggus tickets:
    
    1) analysis jobs fail at "not enough resource available", this is cause by a ulimit set on our Condor system, it is fixed
    
    2) UCORE jobs fail at authentication, due to a problematic work node, the node is rebuilt
    
    3) 2 sites have transfer errors to AGLT2, (timeout), unresolved. ticket priority is changed to low priority
    
    Hardware:
    
    Received 3 R740x2d storage nodes(dCache pool nodes),1 to MSU, 2 to UM. waiting for provision.
    
    Update on BOINC operation at AGLT2:
    
    Reminder: we had difficulties with the kernel not effectively applying a lower priority to the BOINC jobs. They ended up receiving roughly half of the CPU cycles which had never been the goal. At the last meeting Wenjing had presented and reported on switching to using cgroups to try and tame that behavior. But we did not have results or graphs to show at the time.
    
    Here is a link to the current CPU efficiency which is already back to a much more reasonable and acceptable range while we are continuing to work on tuning this BOINC-backfilling model.
    
    https://monit-grafana.cern.ch/d/000000696/job-accounting-historical-data?orgId=17&from=now-7d&to=now&var-bin=1h&var-groupby=dst_experiment_site&var-country=USA&var-federation=All&var-resources=All&var-tier=2&var-cloud=US&var-site=All&var-computingsite=All&var-nucleus=All&var-cores=All&var-eventservice=All&var-groups=All&var-inputdatatypes=All&var-inputprojects=All&var-outputproject=All&var-gshare=All&var-resourceserporting=All&var-processingtype=All&var-jobtype=All&var-jobstatus=All&var-error_category=All&var-measurement_suffix=1h&var-measurement_suffix_CQ=1h&var-retention_policy=long&var-division_factor=1&panelId=34&fullscreen
  - 13:55
    MWT2 5m
    
    Minutes
    
    Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
    
    UC
    
    Various storage work
    
    Combination of broken space reporting and upstream change to a new reaper caused the site to fill up and be taken offline for a few days
    
    Added IPv6 addresses to the MWT2 dCache nodes
    
    Ongoing IPv6 debugging issues. Seems the UChicago IPv6 route isn't being advertised correctly. Ryan is working with Edoardo and ESnet to fix
    
    Final quotes sent to purchasing
    
    IU
    
    Compute quotes finalized
    
    UIUC
    
    Quarterly PM today from 8am to 8pm CST
  - 14:00
    
    NET2 5m
    
    Minutes
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Minor operations issues:
    
    1. Low rate of CA errors only outgoing transfers to only certain external sites. Might or might not be a problem on our end, but we have to investigate.
    
    2. We're occasionally still seeing too many squid failovers.
    
    News:
    
    o Storage purchase out including a slate node
    
    o More worker and storage purchases should go out in the next few days.
    
    o We need a bit of manual operations from CERN DDM to set up the NESE gridftp endpoint for testing.
  - 14:05
    
    SWT2 5m
    
    Minutes
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    - Nothing to report, all running well.
    
    UTA:
    
    - Storage server issues have shown up in both SWT2_CPB and UTA_SWT2, which required moving data to other servers.
    
    - Seeing high loads on some data servers, which is causing problems with Event Index jobs, we are testing if a firmware update fixes the problem.
    
    - SLATE node is being worked on.
  - 14:10
    
    HPC Operations 5m
    
    Speaker: Doug Benjamin (Duke University (US))
  - 14:15
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
- 14:25 → 14:30
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility