US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:15
      Top of the Meeting 15m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:15 13:20
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      From Alessandro Di Girolamo:

      as you might remember we take from Rebus the “installed capacity”  to get the average power 
      of the logical CPU each site has. we use this to then calculate in the ATLAS job monitoring 
      dashboard the HS06 provided by the site.  Having stopped the BDII, you don’t have anymore 
      those numbers in Rebus.  we have discussed a possible solution, which is to add in Rebus 
      the numbers from
      
      
      http://myosg.grid.iu.edu/rgsummary/xml?datasource=summary&summary_attrs_showwlcg=on&all_resources=on&gridtype=on&gridtype_1=on&active=on&active_value=1&disable_value=1%22
      
      <HEPSPEC>
      <APELNormalFactor>

       

      Basically, these are the numbers that we enter into our OSG resources for HS06 and APEL Normalization Factors.  If this is going to be used going forward, then each of us has to keep these numbers up to date.

      ====== Are they correct for your site? ============

      Please make sure that they ARE correct, now and going forward.

       

    • 13:20 13:30
      Production 10m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:35 13:40
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:40 13:45
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
    • 13:45 13:50
      FAX and Xrootd Caching 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:50 14:00
      Site movers 10m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

      1. Interal/External xrootd door (doc is still not fully ready, see Mario's comment)

      https://docs.google.com/document/d/1FKbXCHZ-NA__nFlELUpm_D32OOdXGgq786GeEQotOUA/edit#

      2. Site movers: What need to be done in AGIS:

      • use_newmover=True
      • deprecate_oldmover=True
      • wansinklimit=0 (or None?)

      All production US sites use new site movers. US sites not using new site mover ("use_newmover" != True)

      • ANALY_BNL_CLOUD
      • ANALY_BNL_EC2E1
      • ANALY_BNL_EC2W1
      • ANALY_BNL_EC2W2
      • ANALY_BNL_T3-condor
      • ANALY_NERSC
      • ANALY_ORNL_Titan
      • ANALY_TEST-APF-condor
      • ANALY_TEST-APF2-condor
      • ANALY_WISC_ATLAS
      • BNL_CLOUD
      • BNL_CLOUD_MCORE
      • BNL_EC2E1
      • BNL_EC2E1_MCORE
      • BNL_EC2W1
      • BNL_EC2W1_MCORE
      • BNL_EC2W2
      • BNL_EC2W2_MCORE
      • ES_ORNL_Titan
      • GOOGLE_COMPUTE_ENGINE
      • NERSC-PDSF-sge
      • ORNL_Titan_MCORE (does it matter?)
      • TESTGLEXEC
      • TWTEST
      • TestPilot
      • Titan_Harvester_MCORE
      • UMESHTEST
      • WT2_Install

      Sites using new site movers but not deprecating old movers:

      • ANALY_AGLT2_TEST_SL6-condor
      • ANALY_OU_OCHEP_SWT2-condor
      • BNL_LOCAL-condor
      • Lucille_CE
      • Lucille_MCORE
      • NERSC_Cori_2
      • OUHEP_OSG
      • OU_OCHEP_SWT2-condor
      • OU_OSCER_ATLAS
      • OU_OSCER_ATLAS_MCORE
      • OU_OSCER_ATLAS_OPP

      Sites using new site mover but not set wansinklimit=0 (or None)

      • ANALY_AGLT2_TEST_SL6-condor
      • ANALY_CONNECT
      • ANALY_CONNECT_SHORT
      • ANALY_CONNECT_TEST
      • ANALY_MWT2_HIMEM_MCORE
      • ANALY_MWT2_MCORE
      • ANALY_OU_OCHEP_SWT2-condor
      • NERSC_Cori
      • NERSC_Cori_2
      • NERSC_Edison
      • NERSC_Edison_2
      • SLAC_ES-lsf

       

    • 14:00 14:10
      OS performances testing 10m
      Speaker: Doug Benjamin (Duke University (US))
    • 14:10 14:25
      HPCs integration 15m
      Speaker: Taylor Childers (Argonne National Laboratory (US))
    • 14:25 16:00
      Site Reports
      • 14:25
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • dCache:  
          • slow release of free space: fixed by forcing dcache cleaner run more often
          • high memory usage on admin node: fixed by increasing memory on the server
          • IPv6 fully functional now
        • AFS phase-out on Tier1 computing farm:  ongoing, in rolling mode
        • singularity test:  
          • new condor queue and PanDA queues are being tested via HC jobs
          • individual production jobs also succeeded
          • expect to go in production soon
        • AGIS cleanup:  deleted some obsolete BNL PanDA queues
      • 14:30
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        Last week our "OtherVO" gatekeeper suddently stopped working.  I noticed this first on April 19 when our space-usage.json file stopped updating dating to mid-afternoon on April 18.  Then I noticed that the CE jobs coming in were going into Hold, and never emerging to become real batch system jobs.  Basically, all services were timing out.

        Chased this hard, even going so far as to totally rebuild that gatekeeper (which is good in the sense that gram should now be gone totally from it).  Nothing worked until I moved it temporarily from the grid.umich.edu domain to the aglt2.org domain, and saw both gfal-copy and condor_ce_ping begin to work.

        Subsequently found that the UM subscribes to an IPS Service (Intrusion Prevention System) that had distributed a complete blockage on TSL/SSL traffic.  Our subnet was scanned to their satisfaction, it was white-listed, and the gate-keeper then came back online.

        Beware of the possibility of a similar situation at your home Universities.  It has also affected the MWT2 and Ligo collborators at the UM.

        A dCache OOM issue hit us when the number of multiple srm transfers reached levels above what we've ever seen here (a peak of 18k/10mins was observed a few days ago in srmwatch) in conjunction with just more memory being used anyway with dCache 3.0.11.  We increased the dCacheDomain java instance memory from 2/4gb to 4/8gb and the issue resolved.

        Beyond this services are stable.  Today we will complete the transition to the last of our new N2048 switches for public 1Gb NIC attachment.

        For queue ANALY_AGLT2_TEST_SL6-condor, the wansinklimit and deprecate_old_mover flags are now properly set (as of 1:30pm today).

         

      • 14:35
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))

        Site is now full of jobs and operating well

         

        Minor problems/fixes during the last two weeks

        • Site was running nearly all SCORE HIMEM jobs even though MCORE had many activated
          • Because we had run so many MCORE jobs for so long time (no SCORE), weighting was skewed
          • Changed the condor knob "PRIORITY_HALFLIFE" to 5 minutes to balance out faster (was 1/2 day)
        • The three MWT2 Squids heavily loaded by Frontier requests
          • Impacted access to CVMFS repositories causing slow mounts and access to data
          • Created three "CVMFS" only squids on the local CVMFS Stratum-1 servers
        • NSS bug with certificates
          • Push out to all nodes on Tuesday

         

        OSG 3.3.23-1 installed on all nodes

         

        USERDISK and GROUPDISK decommissioning continuing

        • Waiting on ADC to change Panda Q to use SCRATCHDISK for output by ANALY Qs
        • Reducing size of GROUPDISK and adding freed space to DATADISK

         

        Storage decomissioning has begun

        • In FY17 we are scheduled to retire over 1PB of old storage
        • First server was retired causing a reduction of 120TB of available storage
        • 6 more servers to retire

         

        At UC, SciDMZ upgrade work continues.  New fiber and Arista switch being deployed this week.  Should improve WAN transfers limited by distribution switch between uct2 and the campus SciDMZ border router.

        At UC, CRAC2 unit compressor replaced, system fully back up and running well.   CRAC1,3 units have been assessed by outside vendor, meeting tomorrow to discuss additional needed repairs. 

        Greg is building a hot-isle containment system to improve efficiency.

         

      • 14:40
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        New 1.5PB storage in production.

        WAN and MGHPCC networking upgrade underway: 100G to NoX/LHCONE and multi 40G to NESE fabric.  Equipment is partly here.   We are involved of a re-design of the networking on the MGHPCC floor.  Networking from the Northeastern University Pods to NET2 has been improved so that we can ramp up Mass Open Cloud production.

        There were 5000 transferring jobs from HU_ATLAS_Tier2 this morning.  High but seems to be gradually going down.  We're watching the situation.

        We have smooth operations and full sites otherwise. 

         

      • 14:45
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - all sites running well

        - Langston machine room had an A/C failure which caused a brief downtime. Not completely resolved yet, but head and storage nodes and a few computes nodes are back up, so running at reduced capacity

         

      • 14:50
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        Both sites were rebuilt for the recent security patch from Red Hat.

        UTA_SWT2

        • Updated deprecate_oldmover flag in AGIS for all Panda queues.  No impact observed as of yet
        • Production is running fine

         

        SWT2_CPB

        • Updated deprecate_oldmover flag in AGIS for all Panda queues.  Don't expect any impact, but still looking.
        • Production running fine.
        • Still need to modify AGIS to setup internal Xrootd redirector to support direct reads for analysis jobs.  Now have rights to modify AGIS
      • 14:55
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 16:00 16:05
      AOB 5m