US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2018-11-07T13:00:00-05:00
End: 2018-11-07T15:00:00-05:00
Location: No location set

Wednesday 7 Nov 2018, 13:00 → 15:00 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:05
  
  Top of the Meeting 5m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
- 13:05 → 13:10
  
  HPC integration 5m
  
  Speaker: Doug Benjamin (Duke University (US))
  
  OLCF slides from Danila
  
  Screen Shot 2018-11-07 at 10.53.00 AM.png
- 13:10 → 13:15
  
  ADC news and issues 5m
  
  Speaker: Xin Zhao (Brookhaven National Laboratory (US))
  
  Preliminary ADC agenda on the next ATLAS S&C week :
  
  https://indico.cern.ch/event/770941/
- 13:20 → 13:25
  
  Production 5m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  US-cloud-summary-10_31_18.pdf
  
  US-cloud-summary-11_7_18.pdf
- 13:25 → 13:35
  
  OSG-LHC 10m
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
- 13:35 → 13:40
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  The dumps for BNL were provided by Hiro. DDMops jira opened: https://its.cern.ch/jira/browse/ATLDDMOPS-5465 . Consistency checks start running last week, found 1M dark files on the DATADISK and 120k files on the SCRATCHDISK. After deletions, looks like still significant leftover, which could be a reporting issue or not reported usage.
  
  Dark Data situation (numbers with "-" mean storage reports less than size in rucio):
  
  Site                  DATADISK                  SCRATCHDISK          LOCALGROUPDISK
  
  BNL                 390                              110                              16
  
  AGLT2             9                                  1                                  1
  
  MWT2             9                                  8                                  2
  
  NET2               12                                5                                  0
  
  OU_OSCER    -185                             -2                                 0
  
  SWT2              3                                  1                                  0
  
  WT2                -81                               -5                                 1
  
  IT monitoring team is still working to repopulate the missing data in DDM Accounting dashboard (the SNOW ticket I opened 2 weeks ago INC1705039). Right now the storage values for 8 recent days are still missing.
  
  Hands-on session on the new DDM dashboard, with possibility to give feedback on issues with it we would like to be addressed, Friday. Nov.8 at 15:00 CET.
- 13:45 → 13:50
  Networking 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
  Lots of meetings:
  
  LHCONE/LHCOPN https://indico.cern.ch/event/725706/
  
  IRIS-HEP https://indico.cern.ch/event/755573/timetable/
  
  OSG Retreat https://indico.fnal.gov/event/18117/timetable/#20181107
  
  All of these have network discussions and presentations. In addition there was a perfSONAR face-to-face meeting in Orlando two weeks ago (still no URL for presentations)
  
  There are some known issues with MaDDash, causing our meshes to appear to have less data than they actually do. Working on getting fixes into the perfSONAR developers timeline.
  
  Looking for input and collaboration on the HEPiX network function virtualization effort. See details in presentation at: https://indico.cern.ch/event/725706/contributions/3169183/attachments/1744902/2824548/HEPiX_Network_Functions_Virtualisation_Working_Group_F2F_Meeting.pdf
  
  Shawn
- 13:50 → 13:55
  
  Data delivery and analytics 5m
  
  Speaker: Ilija Vukotic (University of Chicago (US))
  
  XCache at US scale
- 13:55 → 14:30
  Site Reports
  - 13:55
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    We have the script (from John H.) to look at the ratio of idle jobs in the queue, and adjust group quota spillover flags accordingly in the local htcondor pool. We need this to move to UPQ setup.
  - 14:00
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    The Postgresql slave server for the dCache head node lost its hard driver, we have to rebuilt the machine with 2 new hard disks, and we are in the process of restoring the slave database, and setting up the SRMwatch for dcache. We also take this chance to upgrade this host from SLC6 to CentOS 7 and use ZFS to host the Postgresql database.
    
    Some of our blade work nodes have unusual high load (over 200) without any jobs running. And the load goes down when we turn off HTCondor on the work node. Some of the work nodes have disk issues, some of them do not. We can not understand the situation, and we updated a few work nodes to 8.6.12 from 8.4.11 , and give Brian Lin access to these work nodes to debug.
    
    Ref ticket:
    
    https://support.opensciencegrid.org/helpdesk/tickets/7720
    
    We upgraded dCache from version 4.2.6 to 4.2.14. And we took the chance to upgrade the OS and firmware too. Upgrade on some of the MSU pool nodes did not get well. After upgrading the dCache rpms, the dCache service wouldnot start on the pool nodes, the head node thought there was already pool instance running on the pool node and there are lock files in the dCache pool. We tried various things, including updating/rebooting zookeepers, restarting dcache services on head/door nodes, what eventually fixed the problem was to reinstall the dcache rpm on the pool nodes.
    
    When we trying to retiring one storage shelf from one of the UM dcache pool node, the wrong virtual disk was accidentaly deleted, we couldn't manage to recover the vdisk,hence lost over 85K files. for this we opened a JIRA ticket and reported the lost files.
    
    -Wenjing
  - 14:05
    MWT2 5m
    
    Speaker: Judith Lorraine Stephen (University of Chicago (US))
    
    Overall the site has been running without issues.
    
    UC:
    
    working with Ilija and Stephane to set up hospital queue
    
    analysis jobs issue on the rebuilt sl7 nodes caused by missing software; now resolved
    
    IU: sl7 migration in progress
    
    UIUC: nothing new to report
  - 14:10
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
  - 14:15
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU:
    
    Apparently OU compute nodes are failing over from OU Frontier Squid to others, we are investigating. Increased squid cache to 100 GB, and don't see any obvious errors, but failovers continue.
    
    There was a network issue at OneNet in Tulsa, which caused transfer failures and slower transfer speeds, starting Nov 2, but it was fixed last night, everything back to normal now.
    
    UTA:
    
    DMZ rework is now complete.
- 14:30 → 14:35
  
  AOB 5m