US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2019-07-24T13:00:00-04:00
End: 2019-07-24T15:00:00-04:00
Location: No location set

Wednesday 24 Jul 2019, 13:00 → 15:00 US/Eastern

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Minutes
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  New signup sheet for topical presentations, https://docs.google.com/document/d/1NIc67p3AB2RkYjJsP6Nx_lwPXFX03w1n2SFOgCU47ro/edit?usp=sharing
- 13:10 → 13:20
  OSG-LHC 10m
  
  Minutes
  
  Speakers: Brian Lin (University of Wisconsin), Matyas Selmeci
  OSG 3.4.32 (tomorrow?)
  
  XRootD 4.10.0 (and plugins)
  
  Frontier Squid 4.4-2.1
  
  HTCondor 8.8.4 (in osg-upcoming)
  
  Singularity 3.2.1-1.1?
  
  OSG 3.4.33
  
  ATLAS XCache RPM
  
  Slurm v19 support and bug fixes for the gratia-probe
- 13:20 → 14:00
  Topical Report
  - 13:20
    
    SLATE and Security 15m
    
    Minutes
    
    Speaker: Christopher Weaver (University of Chicago)
    
    Slate security ATLAS call 20190724.pdf
    
    Begin forwarded message:
    
    From: Romain Wartel <Romain.Wartel@cern.ch>
    
    Subject: Introduction call -- DRAFT Meeting notes
    
    Date: July 16, 2019 at 9:22:37 AM CDT
    
    To: "wlcg-security-SLATE-wg (WLCG SLATE WG)" <wlcg-security-SLATE-wg@cern.ch>
    
    Resent-From: <rwg@uchicago.edu>
    
    DRAFT Meeting notes of today’s call. All corrections, additions very welcome!
    
    Cheers,
    
    Romain.
    —
    ***************************
    * Meeting participants:
    ***************************
    
    - Chris Weaver
    - Dave Kelsey
    - Frank Wuerthwein
    - Igor Sfiligoi
    - Jim Basney
    - Johannes Elmsheuser
    - Lincoln Bryant
    - Nikolai Hartmann
    - Paul Millar
    - Robert Gardner
    - Romain Wartel
    - Petr Vokac
    - Tom Barton
    - Stephane Jezequel
    - Shawn McKee
    - Vincent Brillault
    - Brian Bockelman
    - Xavier Espinal
    - Elizabeth Sexton-Kennedy
    
    ***************************
    * Agenda
    ***************************
    
    - Romain explained several additional parties (including OSG, Fermilab, ESnet) are interested in participating in this working group but had not yet time to appoint someone.
    
    - Introduction by Chris Weaver (slides on Indico)
    Q&A:
    Vincent: The logs are stored locally with Systemd. Maybe these logs should be forwarded to a central remote system for added security?
    Chris: Agree.
    
    Vincent: The portal stores tokens giving significant access to the infrastructure. How’s the security managed on the portal and for token management?
    Chris: There are different components. Most of the SLATE development team has access to the different systems. Only the “current" tokens are in memory, and the main database is the Amazon DynamoDB, who is accessible only by a couple of people in the SLATE development team.
    
    Vincent: Scanning the container is very nice. Have you also considered scanning for insecure configurations (e.g. exposing unprotected SMB on the Internet)?
    Chris: The current scans are any case quite limited. This said, the number of new SLATE applications is quite low and the review is largely done manually for the moment.
    Searching for configuration issues is a good idea— This is not done yet as the tools are currently quite limited.
    
    - General discussion:
    
    Romain: What should this group try to achieve? What are the security aspects that needs to be covered by this WG?
    Rob: The resources providers would probably like to see an effort to increase the trust in the SLATE service, security model, and overall operational/deployment strategy. SLATE is a significant cultural shift in the way services are operated across a distributed infrastructure.
    A WLCG-wide discussion is needed to address the different (security) challenges to overcome, in order to improve service adoption and gain additional capabilities.
    Igor: As a side admin, I worry about the permissions I need to give to the SLATE developers to make it work; How can I be sure SLATE will not compromise by Kubernetes system?
    Rob: There is documentation available that should address questions around deployment impact, permissions needed, etc.
    Romain: The security team would probably also like to see some basic bases covered (image security, security updates, incident response, etc.)
    
    Romain: Do we need to also discuss security policies and trust framework as part of this working group?
    Dave: Yes, we need to review the risks and understand more details around SLATE. Depending on the findings we may or may not need new security policies.
    Romain: Do we need a security review?
    Dave: Probably. This would enable everybody to understand in more details how SLATE works and implications for the resource providers.
    Tom: Regarding the security review, Jim Basney and TrustedCI have already started some security review work.
    In addition, preparing some kind of “declarations” would help bringing additional transparency, which would be very welcome to improve trust from the different parties involved.
    Jim: TrustedCI would like to engage with this WG and ideally share tasks. TrustedCI is looking at the security policy aspects around the SLATE infrastructure, as well as image security scanning tools. It would be helpful to understand how to map this with WLCG policy and operational security practices.
    Romain: Direct, close cooperation between TrustedCI and this WG is absolutely crucial.
    
    Romain: We should maybe also explore the implications on incident response of the role/responsibility changes implied by the shift in deployment model.
    We will address the trust framework, security policies, security architecture, operational security aspects (vulnerability management, incident response, etc.). Any other obvious area to explore?
    Rob: More topics will probably come up in the near future!
    
    ***************************
    * Next meetings
    ***************************
    
    - Co-locate a side meeting around the September GDB at Fermilab (coordination: Rob, Fermilab)
    https://indico.cern.ch/event/739882/
    
    - Co-locate a side meeting around the October NSF Security Summit / WISE meeting (coordination: Tom, Dave, Romain)
    https://trustedci.org/2019-nsf-cybersecurity-summit
- 13:40 → 14:25
  US Cloud Status
  - 13:40
    
    US Cloud Operations Summary 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
    
    US-cloud-summary-7_17_19.pdf
    
    US-cloud-summary-7_24_19.pdf
  - 13:45
    BNL 5m
    
    Minutes
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    smooth operation in general
    
    problem with migration to pilot2+container
    
    farm drained after the migration. Pilots failed and exited before picking up real jobs, due to a mismatch between pilot2 and wrapper version, because harvester uses a special template for BNL. The templte issue needs to be taken care of first. Backed off to pilot1 for now.
    
    several racks of T1 computing farm is draining now, for scheduled electrical work in the data center late this week and early next week.
    
    continue work on integration of GPU@BNL IC with PanDA
  - 13:50
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
    
    One ticket #142370 on 22-Jul-2019 for transfer errors on timeout. But the logs seem to show the transfer did happen. The failure rate is now back to normal.
    
    The June WLCG report showed AGLT2 with a low (88%) availability and reliability. We realized one of our gatekeepers (gate03) had stopped accepting test jobs. We are trying to get the numbers amended.
    
    Hardware:
    
    General maintenance replacing failed dcache storage disks and worker nodes memory.
    MSU is finishing recovering from July 1st CRAC shutdown from cottonwood problem.
  - 13:55
    MWT2 5m
    
    Minutes
    
    Speakers: Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
    
    Site problems following the migration to pilot2+containers starting 19 July
    
    Size of pilot logs generated was filling the root disks on our three gatekeepers
    
    Fixed in v 2.1.17 released yesterday
    
    UIUC worker issues following ICC PM
    
    ICC quarterly PM was scheduled for 17 July
    
    Kernel panics post upgrade to kernel 3.10.0-957.21.3 and GPFS client 5.0.2-3
    
    ICC admins downgraded our worker image to use the 3.10.0-957 until a fix is in place
    
    Network reconfiguration at IU
    
    Consolidated our old 149.165.224.0/24 and 192.165.225.0/24 into a single /23
    
    Moved our management interfaces off the SciDMZ to a private subnet
    
    IPv6 status
    
    Working with ITS to get v6 PTR records working
    
    Will try again tomorrow to get our storage dual stacked
  - 14:00
    
    NET2 5m
    
    Minutes
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Smooth operations except for a problem today caused by some Tier 3 user jobs.
    
    Setting up 4 nodes to be new squids and/or gridftp endpoints.
    
    Preparing NESE gateways to be ATLAS DDM endpoints.
  - 14:05
    
    SWT2 5m
    
    Minutes
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    1) First part of our recent purchase (storage + compute nodes) starting to get delivered.
    
    2) Migration of UTA_SWT2 to CentOS7 almost done. (Had deferred this update while optimizing slurm configuration at SWT2_CPB). Expect to complete this later today (7/24).
    
    3) Two tickets over the past two weeks (deletion errors at UTA_SWT2 - bad drive in a RAID set; our local tier-3 needed an update to its frontier/squid settings). Both resolved.
    
    OU:
    
    - Overall things running well.
    
    - Had bad RAM DIMM in one xrootd storage server, which caused some deletion and job failures. Just replaced this morning (as well as the motherboard, which seemed to have issues as well). Everything should be back to normal now.
    
    - Getting quotes for compute nodes for remaining hardware funds.
  - 14:10
    
    HPC Operations 5m
    
    Speaker: Doug Benjamin (Duke University (US))
  - 14:15
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    
    Analysis Facilities - BNL 5m
    
    Speaker: William Strecker-Kellogg (Brookhaven National Lab)
- 14:25 → 14:30
  
  AOB 5m

Choose timezone

US ATLAS Computing Facility

Share this page

Direct link

Social networks

Calendaring