US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
Videoconference Rooms
US_ATLAS_Computing_Integration_and_Operations
Name
US_ATLAS_Computing_Integration_and_Operations
Description
Bi-weekly Facilities meeting
Extension
109263008
Owner
Robert William Gardner Jr
Auto-join URL
Useful links
Phone numbers
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))

      Please be sure to register for the Facilities meeting at Argonne:

      https://indico.cern.ch/event/766802/

    • 13:10 13:15
      ADC news and issues 5m
      Speaker: Xin Zhao (Brookhaven National Laboratory (US))
      • migration to harvester/UCORE 
        • analy vs prod PQs with different voms proxy role ? 
      • review of the policy of "extra replica of DAOD to US T2s"
        • some info collected so far : ~20% of DAODs on US T2s are never accessed. 
    • 13:20 13:25
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:25 13:35
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
    • 13:35 13:40
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))

      Dark data cleanup at BNL followed up in the DDMops jira: https://its.cern.ch/jira/browse/ATLDDMOPS-5465 . After the cleanup still significant leftover remains (300-400TB for DATADISK and about 100TB for SCRATCHDISK), which could be a reporting issue or not reported usage. Need to be checked on the storage side.

      Independently of the previous point, BNL storage reporting is stuck since Nov.15 - showing absolute no change in storage numbers since then for any token. This may result in filling the storage. Mentioned this also in that ticket, with BNL guys in CC.

      The storage reporting consistency issue at MWT2_UC_SCRATCHDISK, with storage numbers below the rucio ones. Looks like this happened after ~600K (~90TB) deletion on Nov.8-9 with subsequent transfers filling that freed space.

      SLACXRD_LOCALGROUPDISK space reporting value dropped a couple of days ago, probably just reporting issue.

    • 13:45 13:50
      Networking 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      Working on issues with the OSG/WLCG MaDDash instance:  https://psmad.opensciencegrid.org/maddash-webui/

      • Have issues with IPv6 (dual-stack) nodes because of underlying library that MaDDash depends upon.  perfSONAR developers are aware of the issue
      • Currently there are cases where we have "grey" boxes that indicate no data BUT there actually is data.  Most are due to IPv6 issue but in some cases there may be firewall issues

      The PWA (pSConfig GUI) at https://psconfig.opensciencegrid.org has some issues getting all the hosts published in OIM and GOCDB.   We are working on tracking down the problem in the code in GitHub:  https://github.com/soichih/gocdb2sls

      We have seen some cases where perfSONAR toolkit deployments have default limits set that prevent testing from working.  The toolkits seem to be OK but test results are not showing up.  In some cases this is because of a 10GByte directory size limit.   The file to check for latency nodes is /etc/owamp-server/owamp-server.limits.   The value to increase is 'disk=10G'    Increase it to at least 50G (assuming your disk can hold this much).

    • 13:50 13:55
      Data delivery and analytics 5m
      Speaker: Ilija Vukotic (University of Chicago (US))
    • 13:55 14:30
      Site Reports
      • 13:55
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • BNL FTS issues recently
          • slow transfer from CERN to BNL, solved by raising priority
          • wrongly formatted json file ??
        • BNL FTS upgrade is planned for after thanksgiving
        • preparation for moving prod PQs to UCORE
          • John's script ready, which adjusts HTCondor accounting group quotas based on the pending jobs on the local queue
          • JobRouter changes is done.
          • Need to test them, but firstly we need to agree on the path forward on the analy vs prod issue
        • another tape test, after increasing dCache tape disk buffer, is planned for early Dec.
      • 14:00
        AGLT2 5m
        Speakers: Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)

        No tickets/incidents

        We finished upgrading the slave postgresql database to the dcache head node from sl6 to centos7. ZFS is used to host the postgresql database, and we upgradedthe postgresql to 10-10.6 from 10-10.5 for both the host and slave nodes. 

        We built the 1.8.2 openafs rpms  on the centos 7.5 node.  The new openafs  client (1.8.2) is running well on the centos 7.5 node, we plan to test it on the SL6/7 nodes.

      • 14:05
        MWT2 5m
        Speaker: Judith Lorraine Stephen (University of Chicago (US))

        UC:

        • Upgraded analytics cluster to Elasticsearch 6.5.0
        • Both prod and analy hospital queues set up and running jobs

        IU:

        • Continued progress on the SL7 worker migration

        UIUC:

        • Monthly PM: GPFS client updated
      • 14:10
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))
      • 14:15
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        1) Generally stable operations the past two weeks (production, data transfers, user analysis).

        2) A final reconfiguration of the routing setup for SWT2 was successfully implemented.

        3) Planning for the next hardware procurement.

    • 14:30 14:35
      AOB 5m