US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
Videoconference Rooms
US_ATLAS_Computing_Integration_and_Operations
Name
US_ATLAS_Computing_Integration_and_Operations
Description
Bi-weekly Facilities meeting
Extension
109263008
Owner
Robert William Gardner Jr
Auto-join URL
Useful links
Phone numbers
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:05 13:10
      HPC integration 5m
      Speaker: Doug Benjamin (Duke University (US))
    • 13:10 13:15
      ADC news and issues 5m
      Speaker: Xin Zhao (Brookhaven National Laboratory (US))

      Preliminary ADC agenda on the next ATLAS S&C week :

      https://indico.cern.ch/event/770941/

    • 13:20 13:25
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:25 13:35
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
    • 13:35 13:40
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))

      The dumps for BNL were provided by Hiro. DDMops jira opened: https://its.cern.ch/jira/browse/ATLDDMOPS-5465 . Consistency checks start running last week, found 1M dark files on the DATADISK and 120k files on the SCRATCHDISK. After deletions, looks like still significant leftover, which could be a reporting issue or not reported usage.

      Dark Data situation (numbers with "-" mean storage reports less than size in rucio):

      Site                  DATADISK                  SCRATCHDISK          LOCALGROUPDISK

      BNL                 390                              110                              16

      AGLT2             9                                  1                                  1

      MWT2             9                                  8                                  2                                 

      NET2               12                                5                                  0

      OU_OSCER    -185                             -2                                 0

      SWT2              3                                  1                                  0         

      WT2                -81                               -5                                 1

      IT monitoring team is still working to repopulate the missing data in DDM Accounting dashboard (the SNOW ticket I opened 2 weeks ago INC1705039). Right now the storage values for 8 recent days are still missing.

      Hands-on session on the new DDM dashboard, with possibility to give feedback on issues with it we would like to be addressed, Friday. Nov.8 at 15:00 CET.

    • 13:45 13:50
      Networking 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      Lots of meetings:

      • LHCONE/LHCOPN  https://indico.cern.ch/event/725706/
      • IRIS-HEP https://indico.cern.ch/event/755573/timetable/
      • OSG Retreat https://indico.fnal.gov/event/18117/timetable/#20181107 

      All of these have network discussions and presentations.  In addition there was a perfSONAR face-to-face meeting in Orlando two weeks ago (still no URL for presentations)

      There are some known issues with MaDDash, causing our meshes to appear to have less data than they actually do.  Working on getting fixes into the perfSONAR developers timeline.

      Looking for input and collaboration on the HEPiX network function virtualization effort.  See details in presentation at:  https://indico.cern.ch/event/725706/contributions/3169183/attachments/1744902/2824548/HEPiX_Network_Functions_Virtualisation_Working_Group_F2F_Meeting.pdf 

      Shawn

       

    • 13:50 13:55
      Data delivery and analytics 5m
      Speaker: Ilija Vukotic (University of Chicago (US))
    • 13:55 14:30
      Site Reports
      • 13:55
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • We have the script (from John H.) to look at the ratio of idle jobs in the queue, and adjust group quota spillover flags accordingly in the local htcondor pool. We need this to move to UPQ setup. 
      • 14:00
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        The Postgresql slave server for the dCache head node lost its hard driver, we have to rebuilt the machine with 2 new hard disks, and we are in the process of restoring the slave database, and setting up the SRMwatch for dcache. We also take this chance to upgrade this host from SLC6 to CentOS 7 and use ZFS to host the Postgresql database.

        Some of our blade work nodes have unusual high load (over 200) without any jobs running. And the load goes down when we turn off HTCondor on the work node. Some of the work nodes have disk issues, some of them do not. We can not understand the situation, and we updated a few work nodes to  8.6.12 from  8.4.11 , and give Brian Lin access to these work nodes to debug.

        Ref ticket:

        https://support.opensciencegrid.org/helpdesk/tickets/7720 

        We upgraded dCache from version 4.2.6 to 4.2.14. And we took the chance to upgrade the OS and firmware too. Upgrade on some of the MSU pool nodes did not get well. After upgrading the dCache rpms, the dCache service wouldnot start on the pool nodes, the head node thought there was already pool instance running on the pool node and there are lock files in the dCache pool. We tried various things, including updating/rebooting zookeepers, restarting dcache services on head/door nodes, what eventually fixed the problem was to reinstall the dcache rpm on the pool nodes. 

        When we trying to retiring one storage shelf from one of the UM dcache pool node, the wrong virtual disk was accidentaly deleted, we couldn't manage to recover the vdisk,hence lost over 85K files. for this we opened a JIRA ticket and reported the lost files. 

        -Wenjing

         

         

      • 14:05
        MWT2 5m
        Speaker: Judith Lorraine Stephen (University of Chicago (US))

        Overall the site has been running without issues.

        UC:

        • working with Ilija and Stephane to set up hospital queue
        • analysis jobs issue on the rebuilt sl7 nodes caused by missing software; now resolved 

        IU: sl7 migration in progress

        UIUC: nothing new to report

      • 14:10
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 14:15
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        OU:

        Apparently OU compute nodes are failing over from OU Frontier Squid to others, we are investigating. Increased squid cache to 100 GB, and don't see any obvious errors, but failovers continue.

        There was a network issue at OneNet in Tulsa, which caused transfer failures and slower transfer speeds, starting Nov 2, but it was fixed last night, everything back to normal now.

        UTA:

        DMZ rework is now complete. 

    • 14:30 14:35
      AOB 5m