US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
      • WBS 2.3 re-shuffling on going 
      • Tier2 computing review requested by management, still being organized.  
      • Facility workshop end of this year, 
        • https://doodle.com/poll/hzx53r5qm8769ux7 location TBD
        • 3 days possibly with ADC attendance 
      • Preliminary list of FY19 milestones https://docs.google.com/spreadsheets/d/1MxVS8T47znFhzPyBtIV8nO1Hod-rvUSdxGO35HWKqcU/edit?usp=sharing
      • Deletion issues at BU, UTA : how this is issue can be better centrally handled by us?
      • SLATE deployment ? positive feedback from ITD security at BNL
      • Pilot role needs to be enabled for RW on US storage
    • 13:15 13:20
      ADC news and issues 5m
      Speaker: Xin Zhao (Brookhaven National Laboratory (US))
      • problem with new version of globus rpm requesting new TLS version, breaking BestMan 
        • more info here https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMeetingWeek180924
        • moving away from BestMan
      • addition of gridftp test to SAM, as part of ATLAS_CRITICAL 
        • more info here : https://indico.cern.ch/event/755513/contributions/3131553/attachments/1717584/2771546/adc_weekly_sam.pdf
        • ANLASC the only USATLAS gridftp-only site in this test ? 
      • A more general ADC question : "do we want to increase/improve the test in ETF/SAM, or do we want to have other external sources to publish results in the result DB and use (also) these to monitor the real usability of the sites?"
      • Harvester migration happening on other Clouds, US will be on the list after S&C week
        • more info here : https://docs.google.com/presentation/d/19Bp_EpcwZM4hNZtqE9bjgdIAq-IPDhglHTdifIUIzkk/edit#slide=id.p
    • 13:20 13:30
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      BrianL out until Oct 15. Mátyás Selmeci will be attending the facilities meetings until then.

      globus-gssapi-gsi breakage

      • New version of globus-gssapi-gsi (v13.10-1) in EPEL forces TLS 1.2, breaking Bestman. Fix in config by setting MIN_TLS_PROTOCOL=TLS1_VERSION_DEPRECATED in /etc/grid-security/gsi.conf

      OSG 3.4.18

      Release tomorrow (9/27).

      • CVMFS 2.5.1
      • XRootD 4.8.4 with HTTP support, fixes for xrootd-lcmaps and xrootd-hdfs
      • HTCondor-CE bug fixes
      • Updating globus-gridftp-server packages to match the EPEL versions
      • gratia-probe bug fixes for slurm and condor probes
      • RSV bug fix for bogus "GRACC server not responding" warnings

      OSG 3.4.19

      • autopyfactory 2.4.9
      • gsi-openssh update

      XRootD Overhaul

      • JIRA Epic
      • We are using the StashCache meeting (Thursdays, 1pm Central) to coordinate OSG XCache documentation for ATLAS/CMS/StashCache

      OSG Topology (formerly OIM)

    • 13:25 13:30
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:40 13:45
      Networking 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      Continuing work on the HEP blueprinting effort with ESnet and CMS is ongoing.  Next step was to have 2 sites on each side of the transAtlantic link to compare traffic estimation between ESnet and the experiments.  For ATLAS we have identified AGLT2 and NET2 on the North America side and  IN2P3-LPSC and WUPPERTAL on the EU side.  CMS is using:

      T2_US_Nebraska

      T2_US_UCSD

      CIEMAT-LCG2 (Spain)

      INFN-Roma (Italy)

      Response from ESnet:

      Hello Garhan, Shawn, all,

      I have completed the first run for the given sites for the month of Aug 2018. The results can be seen here: https://docs.google.com/spreadsheets/d/1o78o_SujmZ3TtnMnSQFy1aVZ1DR7ODth0Qa7-RNf9l4/edit?usp=sharing

      On the first page of the spreadsheet I have summarized the prefixes used.

      Two quick observations:

      - Interestingly, I don’t see any traffic from or to NET2 - probably not what we expected.

      - There is a lot of traffic from IN2P3-LPSC to IN2P3-LPSC. 

      How does it look to you?

      Best regards,

      Richard

      Is NET2 not using LHCONE?   May need to switch to SWT2 UTA ?

      Shawn

    • 13:45 13:50
      Data delivery and analytics 5m
      Speaker: Ilija Vukotic (University of Chicago (US))
    • 13:50 13:55
      HPC integration 5m
      Speaker: Doug Benjamin (Duke University (US))
    • 13:55 14:30
      Site Reports
      • 13:55
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • Running fine in general 
        • L1TF security patch applied to all CEs. Risk for WNs considered low.
        • dCache Storage server issue 
          • one disk had failure, triggered restart of the JBOD I/O module 
          • caused stage-in errors for jobs, GGUS 137367
        • disk size on dCache tape pools will be increased from 200TB to 2PB by the end of this week, with target of 5PB by the end of the year 
      • 14:00
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        We have had a LOT of troubles with score-dedicated WN since the upgrade to the foreshadow kernel 3.10.0-862.11.6.el7.x86_64 .  This was done simultaneously with an upgrade of HTCondor to 8.6.12 .  The symptoms are in a few different classifications:

        The load suddenly starts to linearly increase, (this could be an artifact of some cron tasks) but the baseline seems to be that a lock of some kind on the file system is taken, but it is not released.  "top" shows no real work ongoing any longer.  So, cvmfs and HTCondor seem to argue about who should have access.  Commonly stopping HTCondor will clear all the race conditions.  Down-grading to HTCondor 8.4.11 is helpful, but issues still happen, and so the whole foreshadow kernel upgrade is suspect.

        Another related symptom is that usage of swap suddenly surges to use all available.  No processes seem to be killed via oom, but again, load rises, and no real processing goes on.

        Often the Condor startd goes unresponsive, and is killed by the HTCondor watch dogs, but that does not necessarily clear the issue.  HTCondor will sometimes totally crash out and stop, which usually results in an idle WN that must be restarted.

        In a few cases, cvmfs was stuck in a tight loop trying to access disks, or cache, or....  Jakob put out a pre-release of 2.5.2 that addressed this too-aggressive behavior.  This helped, but did not resolve anything, and his final conclusion was that cvmfs itself was not at primary fault.

        The problem seems to have no solution that I'm aware of.  We _could_ back down out of the foreshadow kernel....

        We had a fiber cut on our private network a week or so back, with dropped half the HTCondor slots out of production.  This was repaired after about 16 hours or so.  At the same time, two of our internal switches quit talking, and this crashed our VMWare infrastructure.

        Beyond this, we have been proceeding along fine.

         

      • 14:05
        MWT2 5m
        Speaker: Judith Lorraine Stephen (University of Chicago (US))

        UC/IU: Nothing new to report

        UIUC: CentOS7 migration

        • Workers and head nodes are all upgraded 
        • Currently running opportunistic jobs only
        • SL7 queues created in AGIS but are still disabled
      • 14:10
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        0. Confirmed that we're on LHCONE and in the ESNet monitoring... https://my.es.net/lhcone/view/NET2/flow

        1. Deletion errors appeared on Monday (same error as SWT2), presumably due to the middleware update that Brian B. told us about on Friday.  We fixed this by switching to atlas-gridftp.bu.edu for deletion.  The fact that this worked bodes well for switching from Bestman2->Gridftp with DNS & 5 endpoints.  

        2. We're in the process of retiring the HU queues and absorbing the old equipment into the BU side.  This will simplify operations.  

        3. Still in the process of moving some Harvard users from their ancient Tier 3 to NET3 (BU/Harvard/UMASS).

        4. Preparing a networking plan to connect to NESE storage. 

        5. Getting quotes from DELL.  Interested to hear what the Rob/Shawn/DELL discussion comes up with.

      • 14:15
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA

        • Bestman issues are causing problems with deletion service at both clusters
          • New gridFTP service being rolled out.
          • Awaiting new host certificates from UTA OpSec
        • Network changeover will now occur Sept. 30th (02:00-07:00)
          • UTA will now peer directly with LEARN on our campus
          • Science DMZ traffic is affected by the change
          • Will declare a downtime for the 5 hour window
    • 14:30 14:35
      AOB 5m