US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2020-02-19T13:00:00-05:00
End: 2020-02-19T14:45:00-05:00
Location: No location set

Wednesday 19 Feb 2020, 13:00 → 14:45 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
  
  top of meeting slides
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  Releases this week:
  
  XRootD 4.11.2
  
  UberFTP 2.8-3 (repackaging after OSG contributed patches to the new Grid Community Forum upstream: https://github.com/gridcf/uberftp)
  
  HCC VO update (important if your site supports HCC!)
  
  Reminders
  
  InCommon CA DN formats changed (state abbreviations -> full state names) a few months ago so new host certs may result in a DN change
  
  OSG 3.4 enters critical bug/security fix only support at the end of this month and no support at the end of November 2020: https://opensciencegrid.org/technology/policy/release-series/
  
  Documentation and packaging for XRootD standalone (GridFTP replacement) is ready! https://opensciencegrid.org/docs/data/xrootd/install-standalone/
  
  OSG All Hands registration: https://opensciencegrid.org/all-hands/2020/
  
  Other
  
  There was an issue with the GRACC -> WLCG accounting process for January that was resolved last week (the initial APEL report was broken but was promptly fixed). Xin mentioned that he needed to manually update numbers in CRIC for BNL.
- 13:20 → 13:35
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 15m
- 13:35 → 13:40
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  normal operations in general
  
  two WNs built incompletely, became a blackhole due to missing CVMFS files. Took down for rebuild.
  
  January job accounting numbers were initially off by ~50%, later corrected on APEL. Manually fixed the numbers on CRIC.
  
  data17 reprocessing started today. BNL tape staging running fine so far.
- 13:40 → 14:00
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    software update:
    
    update the OSG software and htcondor-ce to the most recent release on all 3 gate keepers
    
    Frontier Squid is also updated to 4.10-1.1.osg34.el6
    
    Plan to upgrade all our SLC6 nodes to SLC7, including dcache,htcondor,afs services
    
    Job Errors:
    
    A lot of jobs failing at this error:
    
    Non-zero return code from RAWtoESD (65); Logfile error in log.RAWtoESD: "AthMpEvtLoopMgr ERROR Failure in waiting or sub-process finished abnormally"
    
    Some of the work nodes fail 100% of the jobs, we identified and rebuilt around 15 affected work nodes, and after rebuilding, they do not seem to fail many jobs (failure rate lower than 10%)
    
    Note: This error also appears to the jobs on other 8 sites, AGLT2 fails 1/5 of them, there is no ticket, not sure if the error is from the job itself or the work nodes.
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    UC
    
    dCache upgrade to 5.2 in progress as of this morning
    
    Site drained via switcher3 since Monday - is this new behavior?
    
    Updated capacity spreadsheet and topology for new dCache purchases
    
    UIUC
    
    24 new workers (1960 cores) received Monday, in the process of being racked
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    We're having some trouble keeping the site consistently full due to: GPFS sometimes getting slightly clogged -> stage-in timeouts -> blacklisting by HC. I'm not sure if this is overlapping with global production issues. We're still investigating this.
    
    SLATE node transfer happening at MGHPCC today.
    
    BU networking has agreed to set up for ipv6 (NET2 is the first requestor at BU). Started a "project". I'll know more about timescales by Oklahoma. The main issue is updating the DNS infrastructure.
    
    NESE storage racks have UPS power now. The new storage nodes are racked, powered, being tested.
  - 13:55
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    SWT2_CPB
    
    ADC forcibly changed the panda queues to use rucio mover rather than LSM
    
    This caused many problems, but we used it as chance to adopt rucio mover
    
    We can use rucio mover for reading and this is preferred for us.
    
    We can not use rucio mover for writing to storage
    
    rucio mover would not honor the lan_write configuration in AGIS and wan_write does not work from the compute nodes
    
    If it had worked, the PFN's probably could not be registered as was the case when trying the xrootd mover. PFN contains .local domain rather than atlas-swt2.org domain
    
    We have moved back to LSM on the writes for now.
    
    We also discovered an issue with xrdadler32 command from xrootd that affects xrootd site mover and probably rucio mover that shows up during writes. LSM avoids the issue.
    
    Completed the change out of UPS batteries
    
    OU:
    
    - Nothing to report, site running well.
    
    - Need HS06 values for Gold 6230 CPUs.
    
    - Having some xrootd issues with Third-Party-Copy stress tests, following up with experts.
- 14:00 → 14:05
  
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))
  
  GPU Deployment
  
  https://twiki.cern.ch/twiki/bin/view/AtlasComputing/GpuDeployment
- 14:05 → 14:20
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    
    Presentation at SLAC ATLAS group meeting
    
    Presentation at SLAC ATLAS group meeting to push for Jupyter
    
    https://docs.google.com/presentation/d/1B9Xiwk9VwcUqNPxjTrVNwqFoT2UzRutpvn6eSvoJX1w/edit?usp=sharing
  - 14:15
    
    ATLAS ML Platform & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    all running smoothly. Mostly used by David Miller, Alexander Bogatskii for hyper parameter scanning of the CLARIANT network for top tagging. Several new users.
- 14:20 → 14:40
  WBS 2.3.5 Continuous Operations
  
  Conveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-2_12_20.pdf
    
    US-cloud-summary-2_19_20.pdf
  - 14:25
    
    Analytics Infrastructure & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    After ES update everything is working smoothly. Need to define default apps in Kibana for different spaces.
    
    Helping Ivan in moving to DPA space.
    
    Helping Maria with the data popularity project and Petya with Perfsonar data.
    
    Helping Nikolai H with xcache reported data.
    
    Some issues with Perfsonar data replay from tape.
    
    Should work on site specific dashboards.
  - 14:30
    
    Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
    
    Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    
    changes to how RUCIO presents VP service to Jedi are now in production and passing my tests.
    
    Now Jedi logs don't show any VP activity even VP jobs are coming to both AGLT2 and Prague2. Not to MWT2 as our ANALY queue is offline.
    
    Now created and trying to get jobs come to ANALY_MWT2_VP that should read through XCache and write out to AGLT2.
- 14:40 → 14:45
  
  AOB 5m