US ATLAS Computing Facility

US/Eastern
Videoconference Rooms
US_ATLAS_Computing_Integration_and_Operations
Name
US_ATLAS_Computing_Integration_and_Operations
Description
Bi-weekly Facilities meeting
Extension
109263008
Owner
Robert William Gardner Jr
Auto-join URL
Useful links
Phone numbers
    • 13:00 13:10
      WBS 2.3 Facility Management News 10m
      Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
    • 13:10 13:20
      OSG-LHC 10m
      Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci

      OSG 3.4.32 (tomorrow?)

      • XRootD 4.10.0 (and plugins)
      • Frontier Squid 4.4-2.1
      • HTCondor 8.8.4 (in osg-upcoming)
      • Singularity 3.2.1-1.1?

      OSG 3.4.33

      • ATLAS XCache RPM
      • Slurm v19 support and bug fixes for the gratia-probe
    • 13:20 14:00
      Topical Report
      • 13:20
        SLATE and Security 15m
        Speaker: Christopher Weaver (University of Chicago)

        Begin forwarded message:

         

        From: Romain Wartel <Romain.Wartel@cern.ch>

        Subject: Introduction call -- DRAFT Meeting notes

        Date: July 16, 2019 at 9:22:37 AM CDT

        To: "wlcg-security-SLATE-wg (WLCG SLATE WG)" <wlcg-security-SLATE-wg@cern.ch>

        Resent-From: <rwg@uchicago.edu>

         

        DRAFT Meeting notes of today’s call. All corrections, additions very welcome!

        Cheers,

        Romain.

        ***************************
        * Meeting participants:
        ***************************

        - Chris Weaver
        - Dave Kelsey
        - Frank Wuerthwein
        - Igor Sfiligoi
        - Jim Basney
        - Johannes Elmsheuser
        - Lincoln Bryant
        - Nikolai Hartmann
        - Paul Millar
        - Robert Gardner
        - Romain Wartel
        - Petr Vokac
        - Tom Barton
        - Stephane Jezequel
        - Shawn McKee
        - Vincent Brillault
        - Brian Bockelman
        - Xavier Espinal
        - Elizabeth Sexton-Kennedy


        ***************************
        * Agenda
        ***************************

        - Romain explained several additional parties (including OSG, Fermilab, ESnet) are interested in participating in this working group but had not yet time to appoint someone.

        - Introduction by Chris Weaver (slides on Indico)
        Q&A:
        Vincent: The logs are stored locally with Systemd. Maybe these logs should be forwarded to a central remote system for added security?
        Chris: Agree.

        Vincent: The portal stores tokens giving significant access to the infrastructure. How’s the security managed on the portal and for token management?
        Chris: There are different components. Most of the SLATE development team has access to the different systems. Only the “current" tokens are in memory, and the main database is the Amazon DynamoDB, who is accessible only by a couple of people in the SLATE development team.

        Vincent: Scanning the container is very nice. Have you also considered scanning for insecure configurations (e.g. exposing unprotected SMB on the Internet)?
        Chris: The current scans are any case quite limited. This said, the number of new SLATE applications is quite low and the review is largely done manually for the moment.
        Searching for configuration issues is a good idea— This is not done yet as the tools are currently quite limited.

        - General discussion:

        Romain: What should this group try to achieve? What are the security aspects that needs to be covered by this WG?
        Rob: The resources providers would probably like to see an effort to increase the trust in the SLATE service, security model, and overall operational/deployment strategy. SLATE is a significant cultural shift in the way services are operated across a distributed infrastructure.
        A WLCG-wide discussion is needed to address the different (security) challenges to overcome, in order to improve service adoption and gain additional capabilities.
        Igor: As a side admin, I worry about the permissions I need to give to the SLATE developers to make it work; How can I be sure SLATE will not compromise by Kubernetes system?
        Rob: There is documentation available that should address questions around deployment impact, permissions needed, etc.
        Romain: The security team would probably also like to see some basic bases covered (image security, security updates, incident response, etc.)

        Romain: Do we need to also discuss security policies and trust framework as part of this working group?
        Dave: Yes, we need to review the risks and understand more details around SLATE. Depending on the findings we may or may not need new security policies.
        Romain: Do we need a security review?
        Dave: Probably. This would enable everybody to understand in more details how SLATE works and implications for the resource providers.
        Tom: Regarding the security review, Jim Basney and TrustedCI have already started some security review work.
        In addition, preparing some kind of “declarations” would help bringing additional transparency, which would be very welcome to improve trust from the different parties involved.
        Jim: TrustedCI would like to engage with this WG and ideally share tasks. TrustedCI is looking at the security policy aspects around the SLATE infrastructure, as well as image security scanning tools. It would be helpful to understand how to map this with WLCG policy and operational security practices.
        Romain: Direct, close cooperation between TrustedCI and this WG is absolutely crucial.

        Romain: We should maybe also explore the implications on incident response of the role/responsibility changes implied by the shift in deployment model.
        We will address the trust framework, security policies, security architecture, operational security aspects (vulnerability management, incident response, etc.). Any other obvious area to explore?
        Rob: More topics will probably come up in the near future!

        ***************************
        * Next meetings
        ***************************

        - Co-locate a side meeting around the September GDB at Fermilab (coordination: Rob, Fermilab)
        https://indico.cern.ch/event/739882/

        - Co-locate a side meeting around the October NSF Security Summit / WISE meeting (coordination: Tom, Dave, Romain)
        https://trustedci.org/2019-nsf-cybersecurity-summit
         

    • 13:40 14:25
      US Cloud Status
      • 13:40
        US Cloud Operations Summary 5m
        Speaker: Mark Sosebee (University of Texas at Arlington (US))
      • 13:45
        BNL 5m
        Speaker: Xin Zhao (Brookhaven National Laboratory (US))
        • smooth operation in general
        • problem with migration to pilot2+container 
          • farm drained after the migration. Pilots failed and exited before picking up real jobs, due to a mismatch between pilot2 and wrapper version, because harvester uses a special template for BNL. The templte issue needs to be taken care of first. Backed off to pilot1 for now.  
        • several racks of T1 computing farm is draining now, for scheduled electrical work in the data center late this week and early next week.
        • continue work on integration of GPU@BNL IC with PanDA
      • 13:50
        AGLT2 5m
        Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)


        One ticket #142370 on 22-Jul-2019 for transfer errors on timeout. But the logs seem to show the transfer did happen. The failure rate is now back to normal.

        The June WLCG report showed AGLT2 with a low (88%) availability and reliability.  We realized one of our gatekeepers (gate03) had stopped accepting test jobs.  We are trying to get the numbers amended.

        Hardware:

        General maintenance replacing failed dcache storage disks and worker nodes memory.
        MSU is finishing recovering from July 1st CRAC shutdown from cottonwood problem.


         

      • 13:55
        MWT2 5m
        Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))

        Site problems following the migration to pilot2+containers starting 19 July

        • Size of pilot logs generated was filling the root disks on our three gatekeepers
        • Fixed in v 2.1.17 released yesterday

        UIUC worker issues following ICC PM

        • ICC quarterly PM was scheduled for 17 July
        • Kernel panics post upgrade to kernel 3.10.0-957.21.3 and GPFS client 5.0.2-3
        • ICC admins downgraded our worker image to use the 3.10.0-957 until a fix is in place

        Network reconfiguration at IU

        • Consolidated our old 149.165.224.0/24 and 192.165.225.0/24 into a single /23
        • Moved our management interfaces off the SciDMZ to a private subnet

        IPv6 status

        • Working with ITS to get v6 PTR records working
        • Will try again tomorrow to get our storage dual stacked
      • 14:00
        NET2 5m
        Speaker: Prof. Saul Youssef (Boston University (US))

         

        Smooth operations except for a problem today caused by some Tier 3 user jobs.

        Setting up 4 nodes to be new squids and/or gridftp endpoints.

        Preparing NESE gateways to be ATLAS DDM endpoints.  

      • 14:05
        SWT2 5m
        Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))

        UTA:

        1) First part of our recent purchase (storage + compute nodes) starting to get delivered.

        2) Migration of UTA_SWT2 to CentOS7 almost done. (Had deferred this update while optimizing slurm configuration at SWT2_CPB). Expect to complete this later today (7/24).

        3) Two tickets over the past two weeks (deletion errors at UTA_SWT2 - bad drive in a RAID set; our local tier-3 needed an update to its frontier/squid settings). Both resolved.

         

        OU:

        - Overall things running well.

        - Had bad RAM DIMM in one xrootd storage server, which caused some deletion and job failures. Just replaced this morning (as well as the motherboard, which seemed to have issues as well). Everything should be back to normal now.

        - Getting quotes for compute nodes for remaining hardware funds.

         

      • 14:10
        HPC Operations 5m
        Speaker: Doug Benjamin (Duke University (US))
      • 14:15
        Analysis Facilities - SLAC 5m
        Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
      • 14:20
        Analysis Facilities - BNL 5m
        Speaker: William Strecker-Kellogg (Brookhaven National Lab)
    • 14:25 14:30
      AOB 5m