US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2020-01-22T13:00:00-05:00
End: 2020-01-22T14:45:00-05:00
Location: No location set

Wednesday 22 Jan 2020, 13:00 → 14:45 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
  
  https://docs.google.com/document/d/1NIc67p3AB2RkYjJsP6Nx_lwPXFX03w1n2SFOgCU47ro/edit
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  The GridFTP replacement, OSG XRootD standalone, documentation is live: https://opensciencegrid.org/docs/data/xrootd/install-standalone/
  
  HTTP/S enabled by default
  
  Supports HTTPS third-party copy
  
  Meeting notes:
  
  SWT2 and NET2 interested in testing xrootd-https, Xin/Tier1 already is
  
  RHEL8 (Doug) for OSG? Timeframe: OLCF decision VM for Harvester coming up / also python3 as default? (Brian thinks yes)
- 13:20 → 13:35
  
  Topical Report
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
- 13:35 → 13:40
  Tier1 Center 5m
  
  Speakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
  Full run2 reprocessing ongoing, for BNL : ~1.3M files, 2.9PB to stage out of tape.
  
  slow deletion on DATADISK
  
  GGUS 144845
  
  cleaner has been running fine after dCache upgrade. But this time there was also DOMA-http TPC tests ongoing at the same time. External script is used to help speed up release of deleted space, ~4PB.
- 13:40 → 14:00
  Tier2 Centers
  
  Convener: Shawn Mc Kee (University of Michigan (US))
  - 13:40
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
        Hardware:
    
        - no big change, no major issue, typical maintenance.
    
        - Progress continues on retiring older T2 storage at MSU and T3 storage at UM.
    
        Services:
    
        - One new ggus ticket (144783) about jobs losing heartbeat. We verified at site.
          The number of jobs losing heartbeat has been consistent at the site, about 100-200 jobs per day.
          This also seems to have similar symptoms as seen at other sites (see MWT2 ticket 144756)
          and tentatively tracked down to the pilot with a fix recently put in place.
    
        - Condor Problem: on Jan 21st, starting around 4am, the running jobs in condor started to drop down to 20%
          spent a few hours investigate, eventually rebooting the Condor central server
          and another Tier 3 submission machine solved this problem.
    
        - Getting close to adding (restoring) xrootd.aglt2.org SAN to dcache doors SSL certificate.
    
    NOTE: Wenjing Wu is on vacation starting today through the next two weeks and then will be working for one week from China (use non-Gmail email to reach her:  wuwj@ihep.ac.cn or wwu@cern.ch ) Back on the 17th of February
  - 13:45
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    GGUS Tickets:
    
    Ticket 144756 "problems at ANALY_MWT2_UCORE" (Closed)
    
    Jobs stuck in “scouting” status. Pilot stuck in endless monitoring loop
    
    Pilot update pushed that fixed the issue. Jobs no longer getting stuck for days
    
    Ticket 144542 "pilot stage-in issues" (Closed)
    
    No update for couple weeks now after our last change. I pinged it last monday thinking 144756 was a similar issue. Closed it now that there doesn't seem to be a problem and nobody has commented/complained.
    
    Ticket 144798 & 144808 (Closed)
    
    Duplicate issue as 144756
    
    Reopened as 144808. We evicted a large amount of jobs manually to allow new production jobs in as we weren't sure when a fix would happen.
    
    Ticket 144840 "MWT2 stage-in issues"
    
    Auth Failed popping up on xrootd downloads of files. Currently investigating by manual testing and checking logs.
    
    UC:
    
    Began network setup, but fell behind trying to get software from vendor. ETA is next week
    
    UIUC:
    
    Still waiting on new purchase arrival.
    
    IU:
    
    Ready for IPv6 setup according to network team. Will begin trial setup in the coming weeks.
  - 13:50
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Smooth operations.
    
    Started ipv6 journey. BU networking working on getting addresses, we're preparing to dual stack ddm endpoints first.
    
    New NESE endpoint working.
    
    Prep work for adding new DELL NESE storage (6PB raw). Storage arrived. Networking gear still arriving. Still waiting on UPS power to three new racks at MGHPCC.
  - 13:55
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    One issue with XrootdFS mount on a GridFTP door caused problems with deletions.
    
    Everything else running well.
    
    OU:
    
    - Nothing to report, site running well.
    
    - There was a brief HC site outage over the weekend, caused by HC jobs being killed by the pilot because they consumed too much RAM. Those HC jobs were stopped again by Petr.
- 14:00 → 14:05
  
  HPC Operations 5m
  
  Speakers: Doug Benjamin (Duke University (US)), Marc Gabriel Weinberg (University of Chicago (US))
- 14:05 → 14:20
  Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    
    BNL T3: ATLAS GPFS filesystem outage a few hours on Sunday, thinking cluster went down due to new (incompatible) kernel modules being built & installed on Friday, caused stale mounts 12pm-6pm Sunday.
    
    Otherwise normal operations
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    
    ATLAS ML Platform & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    ML platform running fine. Every now and then a new user comes and needs a bit of help with starting work. No shortage of GPUs now.
    
    Starting work on Reinforcement Learning OpenAI environment for smarter caching decisions. This experience will be valuable for other use cases.
- 14:20 → 14:40
  Continuous Operations
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 14:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-1_15_20.pdf
    
    US-cloud-summary-1_22_20.pdf
  - 14:25
    
    Analytics Infrastructure & User Support 5m
    
    Speaker: Ilija Vukotic (University of Chicago (US))
    
    New ES nodes should be connected to the cluster next week. Update of ES at the same time.
    
    We informed people of a pending removal of the "spare" ES cluster. Two people asked for a delay. New date of removal is 25th.
    
    Slowly replaying perfsonar data from tape. Still some issues to fix.
    
    Getting meta and status perfsonar indices into RMQ and tape. Work done on getting ESnet data following the same data flow.
    
    Starting work on organizing data annotations.
  - 14:30
    
    Intelligent Data Delivery R&D (co-w/ WBS 2.4.x) 5m
    
    Speakers: Andrew Hanushevsky (Unknown), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    
    XCache was working stably at MWT2, AGLT2, Prague, BNL.
    
    VP changes requested by Rod, Tadashi.
    
    Created XXX_VP_DISK for all 4 sites and "connected" them to ANALY queues at sites.
    
    There are edge cases that need to be addressed: eg. original data copy exists only on tape.
    
    Quite a bit of traffic on all XCaches (> 3Gbps).
    
    Now reporting all requests and replies to/from VPservice to ES so we can monitor it. Need to find a way to label jobs brokered against VP copies, now it's rather complex to identify them.
    
    ServiceX work - new high performance transformer, work on kafka deployment, monitoring, performance characteristics.
- 14:40 → 14:45
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility