US ATLAS Computing Integration and Operations

US/Eastern
Description
Notes and other material available in the US ATLAS Integration Program Twiki
    • 13:00 13:05
      Top of the Meeting 5m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:05 13:10
      ADC news and issues 5m
      Speakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
      • Squad 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:10 13:20
      OSG news 10m
      Speakers: Brian Lin (University of Wisconsin), Brian Paul Bockelman (University of Nebraska Lincoln (US))

      https://opensciencegrid.github.io/technology/policy/service-migrations-spring-2018/

      • Mitigations 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
      • OSG Services Migration 20m
        Speaker: Brian Paul Bockelman (University of Nebraska Lincoln (US))
    • 13:20 13:25
      Production 5m
      Speaker: Mark Sosebee (University of Texas at Arlington (US))
    • 13:25 13:30
      Data Management 5m
      Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    • 13:30 13:35
      Data transfers 5m
      Speaker: Hironori Ito (Brookhaven National Laboratory (US))
    • 13:35 13:40
      Networks 5m
      Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)

      OSG network services are migrating to AGLT2_MSU's VMware instance.  

      • New VMs (psetf, psrsv and psconfig AND their ITB versions) already are created.
      • Plan is to test new VMs to ensure they are correct, do a final service data migration and then cut-over the DNS IP addresses for *.opensciencegrid.org to put them into production
      • OSG ESmond (central MA) and associated VMs will be shutdown (7 of them)

      Today was a HEPiX Network Function Virtualization working group met today:  https://indico.cern.ch/event/715631/

      • Recording will be posted soon for those that missed it.

       

    • 13:40 13:45
      XCache 5m
      Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))

      Technical details discussed at the regular XCache Monday meeting. 

      Wei and Xin got xcache robot cert copies. Wei and Shawn will try Lincoln's instructions to set up their own k8s clusters as soon as they get hardware in place. Ilija will test how things get deployed at Utah cluster. Helm deployment development will start with multinode xCache cluster (so not yet). Will have to discuss things with Andy how to organize cluster and it's storage.

      At TCB meeting presented effects of pCache. ARC site claims ~80% cache hit rate from 250TB LRU cache! 

      Developing code to simulate different caching configurations.

       

    • 13:45 13:50
      HPCs integration 5m
      Speaker: Taylor Childers (Argonne National Laboratory (US))

      NERSC using Harvester + Minipilot. Working smoothly. Allocation completely exhausted this past week. Containers being used for software distribution.

      ALCF using Harvester + Yoda. Lots of debug work ongoing related to PanDA settings, JumboJobs, Athena performance improvements, etc. Aiming for mid-May to have Yoda tested, validated and ready for production jobs. Singularity containers now being used for software distribution.

      OLCF testing Harvester + minipilot, but not yet in production. Aiming for mid-May for Harvester online. No ETA for containers in production, but no obvious hang ups to deploying them at this point.

    • 13:50 14:25
      Site Reports
      • 13:50
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • increase of LOCALGROUPDISK space to 1PB, to help with decommissioning SE on small sites 
        • as the new TOR switches are in place, migration of the rest of the farm to SL7 started this week, in a rolling fashion, some old WNs will be removed as well.  The whole process is expected to finish by late May.
        • OSG migration
          • working with ITD on joining InCommon CA
          • working with GGUS on interfacing GGUS with BNL RT directly. 
      • 13:55
        AGLT2 5m
        Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)

        All UM C6420 sleds are now powered up.  All but 4 are now in HTCondor production, with the last 4 now ready with power on the PDUs now balanced out.  All give consistent HS06 results.

        MSU sleds still awaiting switch reconfiguration.

        All operations running normally.

         

      • 14:00
        MWT2 5m
        Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        Overall the site is performing well and is full of jobs

        • UC
          • Elasticsearch: new ES hardware now online
        • IU
          • C6420 deployment: working on getting the first C6420 built
        • UIUC
          • no new updates
      • 14:05
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Our OSG 3.4 / LCMAPS / HTCONDOR upgrade is done.  

        Our HTCONDOR problems have not appeared at Harvard or BU since the upgrade... so far.  Fingers crossed.

        We're ready to migrate away from GRAM.  Coordinating with Jose and John Hover. 

        Installed Wei's version of Gridftp with callout to our Adler32 code.  Works fine.

        Next up is migrating away from Bestman.  

        LHCONE peering resumed successfully after replacing a bad card at MANLAN.  

        NET3 "Northeast Tier 3" slowly growing.  UMASS/Amherst buy-in ordered. 

        We strangly had a very high rate of deletion for 2-3 days, causing SRM stress and a couple of trouble tickets.  We made some adjustments and then the deletion rate also mysteriously went down by a factor of 5 or so.  

        There is lots of NESE activity.  10 PB raw deployment ordered.

        SL7 migration coming soon.

      • 14:10
        SWT2-OU 5m
        Speaker: Dr Horst Severini (University of Oklahoma (US))

        - Currently in OSCER scheduled maintenance till this evening

        - Lucille cooling failure, running with reduced capacity

        - Had OU network problem last week, fixed

        - Experienced 50% DDM transfer failures earlier this week, which was tracked down to checksum timeouts. Extended timeout for that in our gridftp server, which fixed failures. Wei knows details.

         

      • 14:15
        SWT2-UTA 5m
        Speaker: Patrick Mcguigan (University of Texas at Arlington (US))

        UTA_SWT2

        • Will try to update production CE to HTCondor

        SWT2_CPB

        • Still scaling HTCondor, Identified an issue with SAM tests
        • XrootD issue with long pathnames still exists and not replicated by Wei, will investigate further.
      • 14:20
        WT2 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    • 14:25 14:30
      AOB 5m
    • 14:30 14:40
      Singularity / centos 7 deployment in the US cloud 10m
      Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))