US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2018-11-21T13:00:00-05:00
End: 2018-11-21T15:00:00-05:00
Location: No location set

Wednesday 21 Nov 2018, 13:00 → 15:00 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:05
  
  Top of the Meeting 5m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  Please be sure to register for the Facilities meeting at Argonne:
  
  https://indico.cern.ch/event/766802/
- 13:10 → 13:15
  ADC news and issues 5m
  
  Speaker: Xin Zhao (Brookhaven National Laboratory (US))
  migration to harvester/UCORE
  
  analy vs prod PQs with different voms proxy role ?
  
  review of the policy of "extra replica of DAOD to US T2s"
  
  some info collected so far : ~20% of DAODs on US T2s are never accessed.
- 13:20 → 13:25
  
  Production 5m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  cloud-summary-11_14_18.pdf
  
  cloud-summary-11_21_18.pdf
- 13:25 → 13:35
  
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
- 13:35 → 13:40
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  Dark data cleanup at BNL followed up in the DDMops jira: https://its.cern.ch/jira/browse/ATLDDMOPS-5465 . After the cleanup still significant leftover remains (300-400TB for DATADISK and about 100TB for SCRATCHDISK), which could be a reporting issue or not reported usage. Need to be checked on the storage side.
  
  Independently of the previous point, BNL storage reporting is stuck since Nov.15 - showing absolute no change in storage numbers since then for any token. This may result in filling the storage. Mentioned this also in that ticket, with BNL guys in CC.
  
  The storage reporting consistency issue at MWT2_UC_SCRATCHDISK, with storage numbers below the rucio ones. Looks like this happened after ~600K (~90TB) deletion on Nov.8-9 with subsequent transfers filling that freed space.
  
  SLACXRD_LOCALGROUPDISK space reporting value dropped a couple of days ago, probably just reporting issue.
- 13:45 → 13:50
  Networking 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
  Working on issues with the OSG/WLCG MaDDash instance: https://psmad.opensciencegrid.org/maddash-webui/
  
  Have issues with IPv6 (dual-stack) nodes because of underlying library that MaDDash depends upon. perfSONAR developers are aware of the issue
  
  Currently there are cases where we have "grey" boxes that indicate no data BUT there actually is data. Most are due to IPv6 issue but in some cases there may be firewall issues
  
  The PWA (pSConfig GUI) at https://psconfig.opensciencegrid.org has some issues getting all the hosts published in OIM and GOCDB. We are working on tracking down the problem in the code in GitHub: https://github.com/soichih/gocdb2sls
  
  We have seen some cases where perfSONAR toolkit deployments have default limits set that prevent testing from working. The toolkits seem to be OK but test results are not showing up. In some cases this is because of a 10GByte directory size limit. The file to check for latency nodes is /etc/owamp-server/owamp-server.limits. The value to increase is 'disk=10G' Increase it to at least 50G (assuming your disk can hold this much).
- 13:50 → 13:55
  
  Data delivery and analytics 5m
  
  Speaker: Ilija Vukotic (University of Chicago (US))
  
  XCache at SC18
- 13:55 → 14:30
  Site Reports
  - 13:55
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    BNL FTS issues recently
    
    slow transfer from CERN to BNL, solved by raising priority
    
    wrongly formatted json file ??
    
    BNL FTS upgrade is planned for after thanksgiving
    
    preparation for moving prod PQs to UCORE
    
    John's script ready, which adjusts HTCondor accounting group quotas based on the pending jobs on the local queue
    
    JobRouter changes is done.
    
    Need to test them, but firstly we need to agree on the path forward on the analy vs prod issue
    
    another tape test, after increasing dCache tape disk buffer, is planned for early Dec.
  - 14:00
    
    AGLT2 5m
    
    Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    No tickets/incidents
    
    We finished upgrading the slave postgresql database to the dcache head node from sl6 to centos7. ZFS is used to host the postgresql database, and we upgradedthe postgresql to 10-10.6 from 10-10.5 for both the host and slave nodes.
    
    We built the 1.8.2 openafs rpms on the centos 7.5 node. The new openafs client (1.8.2) is running well on the centos 7.5 node, we plan to test it on the SL6/7 nodes.
  - 14:05
    MWT2 5m
    
    Speaker: Judith Lorraine Stephen (University of Chicago (US))
    
    UC:
    
    Upgraded analytics cluster to Elasticsearch 6.5.0
    
    Both prod and analy hospital queues set up and running jobs
    
    IU:
    
    Continued progress on the SL7 worker migration
    
    UIUC:
    
    Monthly PM: GPFS client updated
  - 14:10
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
  - 14:15
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    1) Generally stable operations the past two weeks (production, data transfers, user analysis).
    
    2) A final reconfiguration of the routing setup for SWT2 was successfully implemented.
    
    3) Planning for the next hardware procurement.
- 14:30 → 14:35
  
  AOB 5m