US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:05 13:15
      Singularity / centos 7 deployment in the US cloud 10m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))

      Discussion among Rob, Xin and Wei:

      It is better to de-couple the CentOS 7 migration and Singularity deployment, so that C7 migration can happen sooner. ADC doc for C7 migration:  

      https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Readiness

      US sites has the options to take Rolling Transition or Big Bang Transition. We will work site-by-site to help the transition. ADC strongly suggest the Singularity 2.4.2 rpms be installed on C7 WNs.

      • AGLT2 is in the process, using Rolling Transition
      • BNL is doing Rolling Transition, plus Containerized WNs (see below). So moving forward unnoticed.

      On Singularity: see presentation at ADC Site Jamboree

      1. Singularity version 2.4.2, no 2.3.x
      2. Ultimately pilot 2 will evoke Singularity - compatible with US APF (and EU APF and aCT)
        • Pilot 2 is not quite ready.
      3. Incompatible with Containerized WNs (Encapsulate payload in a Container)
        • Containerized WNs is not a requirement, and you are on your own to support it
        • But not forbidden either (good for learning and trying it out).  
      4. Will need container_type and container_options setting in AGIS / Panda Queue
      5. For HPCs, investigating methods to reduce container image size
        • single release image
        • use SquashFS instead of Ext3 - doing this for NERSC - reduce by a factor of 3
    • 13:15 13:20
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • Updated doc on migration to CentOS7
        • https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Readiness
      • singularity 2.4.2 is now the baseline version
        • old version can't run the latest images, which is created using singularity 2.4
    • 13:20 13:25
      OSG software issues 5m
      Speaker: Brian Lin (University of Wisconsin)

      We're trying to track down deprecated OSG environment variables https://jira.opensciencegrid.org/browse/SOFTWARE-3011). The following don't appear to be used by any pilots:

      • OSG_DATA
      • OSG_DEFAULT_SE
      • OSG_GLEXEC_LOCATION
      • OSG_HOSTNAME
      • OSG_LOCATION
      • OSG_STORAGE_ELEMENT

      So we would like to remove them in OSG 3.4 or at the very least, announce their deprecation.

    • 13:25 13:30
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:30 13:35
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:35 13:40
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:40 13:45
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      Last week I attended two networking meetings:  LHCONE/LHCOPN in Abingdon:  https://indico.cern.ch/event/681168/ and the perfSONAR annual developer meeting in Amsterdam. (no public link)    Lots of good discussion at both.  LHCONE/LHCOPN meeting report at https://indico.cern.ch/event/681168/attachments/1616425/2569199/LHCOPNE-20180307-Abingdon-meeting-report.pdf

      Today was the 2nd HEPiX NFV wg meeting https://indico.cern.ch/event/705126/    Next meeting April 25 10 AM Eastern.  Live notes at https://docs.google.com/document/d/1CTsAqioZY8pcCDf3S7GbObHD_Sic06BF15dPmaVjOcM/edit

      Questions on these meetings?

      I won't go into other  networking details here unless there are questions.  Next week at the OSG AHM meeting there are 4 talks on Networking:

      USATLAS meeting:  Network evolution  (Shawn)

      Joint USATLAS/FIFE/USCMS meeting:   perfSONAR discussion (Shawn)

      Tuesday afternoon: OSG Networking Analytics: Evolution and Status (Shawn / Ilija)

      Wednesday afternoon:  OSG Networking (Shawn)

      If you have questions (or specific things you think need covering in any of the above) bring it up now or email me.

    • 13:45 13:50
      XCache 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
    • 13:50 13:55
      HPCs integration 5m
      Speaker: Taylor Childers (Argonne National Laboratory (US))

      Harvester deployment:

      ALCF: Locally installed Rucio version out dated enough to cause issues. Had to reinstall Harvester to get things consistent again. Back online, but needs work. Discussing with Doug G if we should continue with dedicated tasks or grid-style running, each comes with their own benefits/drawbacks.

      NERSC: Harvester up and running on Cori-P1/P2, processed 50M+ events over the past 7 days.

      OLCF: Harvester now running for the Allocation jobs queue. Running 3 batch jobs at a time with 800 nodes each. 

      Container deployment:

      NERSC: done

      OLCF/ALCF: still in development. 

    • 13:55 14:30
      Site Reports
      • 13:55
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • running fine in general
        • new xrootd version (release candidate) was put in place last week, which fixed an issue that caused xrootd crash with core dumps. The official release will come later. 
        • new WNs are in production since several weeks ago. The migration to SL7 for the rest of the farm will be combined with upgrade of new top-of-rack-switches, to minimize the downtime.  
      • 14:00
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        Four C6420 chassis and sleds are being racked today.  We will configure them as SL7 WN as we get them up and ready to go.

        All WN at MSU are now running SL7 as are 1/3 of the WN at UM.  We are developing a plan to move the balance of the UM WN to SL7 by the end of March.

        As of today, all of our dCache servers are dual IPv4/IPv6 stacked.  We have not yet registered AAAA records though.

        We have a network interruption between UM and MSU on Thursday night that will adversely impact HTCondor communications between our sites for a period of up to 4 hours.  Consequently we will be idling down all MSU WN starting later this afternoon so as to lose the minimal number of jobs during the outage.

        On Friday after the MSU WN set is back online, we will add SL7 Analysis and LMEM queues, rounding out the SL7 Panda Queue complement for AGLT2.  When the complement of SL6 WN drops below some threshold, we will delete the SL6 Panda Queues and become SL7-only.

        We are coordinating with the OSG folks on moving our non-ATLAS gate-keeper to SL7.  This will most likely happen some time next week.

        Singularity is installed on all WN as they are built, but no special configuration considerations have been implemented.  Versions:

        singularity-2.4.2-1.osg34.el7.x86_64
        singularity-runtime-2.4.2-1.osg34.el7.x86_64

         

      • 14:05
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        Overall, site is performing well and is full of jobs

        Singularity upgraded to 2.4.2 on all workers

        UC

        • four of the twenty new C6420s online and running jobs
        • remaining sixteen are built but still offline
          • spec results are low (less than 50% of what they are expected to be)
          • BIOS settings are consistent, appear to be correct
          • has anyone else had this issue with this latest batch of workers?

        IU

        • still waiting on power
        • work order is in, but timeframe is unknown

        UIUC

        • nothing new to report
      • 14:10
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

        Just about ready to start "NET3", a joint Tier 3 with BU, Harvard and UMASS/Amherst.  

        Progress in HT-CONDOR migration with Brian Lin's help.  Harvard has upgraded to OSG 3.4 with the new HTCONDOR.  Problem has so far not reappeared.  Setting up to do the same on the BU side.

        Working on LCMAPS and Bestman migration (we're not worried about usatlas1,2,3,4 since they are all in the same unix group and group permissions are enough to do everything).  We're planning to use Wei's gridftp-posix with a callout for Adler32 checksum computing.

        Working on GPFS migration so that the system pool is on warrantied equipment.

        Preparing for starter NESE data lake deployment.  ~ 12 PB raw, including substantial buy-in from Harvard.  

        Reminder: We're planning to migrate the NET2 storage endpoint into NESE.

        Added Fermilab access for OSG jobs.

        Sites consistently full with smooth operations.  

        Hoping for ESNet help to help restart our LHCONE peering.  

        SL7 transition is on the agenda. 

         

         

      • 14:15
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - all OU sites working well

        - still working on getting rucio to use READ_LAN and WRITE_LAN, in order to stage-in/out from internal xrootd directly. Working with Mario and Alexey on that

        - Lucille is ready to be migrated from Lucille_SE to OU_OSCER_ATLAS_SE

        - taking brief OSCER downtime this afternoon for RAM replacement and BIOS updates

         

      • 14:20
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        SWT2_CPB

        •  Seemingly solved an issue related to XRootD checksumming that was causing many problems.
        • Major power outage when utility feed burned up and building generator failed.  Both have been repaired
        • Delayed working on HTCondor while dealing with the above

        UTA_SWT2

        • Updated firmware in Dell 4032 stack to avoid issues with lockup
        • Power outage at SWT2_CPB affected the network path for this cluster.
      • 14:25
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:30 14:35
      AOB 5m