US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2018-03-14T13:00:00-04:00
End: 2018-03-14T15:00:00-04:00
Location: No location set

Wednesday 14 Mar 2018, 13:00 → 15:00 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:05
  
  Top of the Meeting 5m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
- 13:05 → 13:15
  Singularity / centos 7 deployment in the US cloud 10m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  Discussion among Rob, Xin and Wei:
  
  It is better to de-couple the CentOS 7 migration and Singularity deployment, so that C7 migration can happen sooner. ADC doc for C7 migration:
  
  https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Readiness
  
  US sites has the options to take Rolling Transition or Big Bang Transition. We will work site-by-site to help the transition. ADC strongly suggest the Singularity 2.4.2 rpms be installed on C7 WNs.
  
  AGLT2 is in the process, using Rolling Transition
  
  BNL is doing Rolling Transition, plus Containerized WNs (see below). So moving forward unnoticed.
  
  On Singularity: see presentation at ADC Site Jamboree:
  
  Singularity version 2.4.2, no 2.3.x
  
  Ultimately pilot 2 will evoke Singularity - compatible with US APF (and EU APF and aCT)
  
  Pilot 2 is not quite ready.
  
  Incompatible with Containerized WNs (Encapsulate payload in a Container)
  
  Containerized WNs is not a requirement, and you are on your own to support it
  
  But not forbidden either (good for learning and trying it out).
  
  Will need container_type and container_options setting in AGIS / Panda Queue
  
  For HPCs, investigating methods to reduce container image size
  
  single release image
  
  use SquashFS instead of Ext3 - doing this for NERSC - reduce by a factor of 3
- 13:15 → 13:20
  ADC news and issues 5m
  
  Speakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
  Updated doc on migration to CentOS7
  
  https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Readiness
  
  singularity 2.4.2 is now the baseline version
  
  old version can't run the latest images, which is created using singularity 2.4
- 13:20 → 13:25
  OSG software issues 5m
  
  Speaker: Brian Lin (University of Wisconsin)
  We're trying to track down deprecated OSG environment variables https://jira.opensciencegrid.org/browse/SOFTWARE-3011). The following don't appear to be used by any pilots:
  
  OSG_DATA
  
  OSG_DEFAULT_SE
  
  OSG_GLEXEC_LOCATION
  
  OSG_HOSTNAME
  
  OSG_LOCATION
  
  OSG_STORAGE_ELEMENT
  
  So we would like to remove them in OSG 3.4 or at the very least, announce their deprecation.
- 13:25 → 13:30
  
  Production 5m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-3_14_18.pdf
  
  shift-summary-3_7_18.pdf
- 13:30 → 13:35
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
- 13:35 → 13:40
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:40 → 13:45
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
  
  Last week I attended two networking meetings: LHCONE/LHCOPN in Abingdon:  https://indico.cern.ch/event/681168/ and the perfSONAR annual developer meeting in Amsterdam. (no public link) Lots of good discussion at both. LHCONE/LHCOPN meeting report at https://indico.cern.ch/event/681168/attachments/1616425/2569199/LHCOPNE-20180307-Abingdon-meeting-report.pdf
  
  Today was the 2nd HEPiX NFV wg meeting https://indico.cern.ch/event/705126/ Next meeting April 25 10 AM Eastern. Live notes at https://docs.google.com/document/d/1CTsAqioZY8pcCDf3S7GbObHD_Sic06BF15dPmaVjOcM/edit
  
  Questions on these meetings?
  
  I won't go into other  networking details here unless there are questions. Next week at the OSG AHM meeting there are 4 talks on Networking:
  
  USATLAS meeting:  Network evolution (Shawn)
  
  Joint USATLAS/FIFE/USCMS meeting: perfSONAR discussion (Shawn)
  
  Tuesday afternoon: OSG Networking Analytics: Evolution and Status (Shawn / Ilija)
  
  Wednesday afternoon: OSG Networking (Shawn)
  
  If you have questions (or specific things you think need covering in any of the above) bring it up now or email me.
- 13:45 → 13:50
  
  XCache 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:50 → 13:55
  
  HPCs integration 5m
  
  Speaker: Taylor Childers (Argonne National Laboratory (US))
  
  Harvester deployment:
  
  ALCF: Locally installed Rucio version out dated enough to cause issues. Had to reinstall Harvester to get things consistent again. Back online, but needs work. Discussing with Doug G if we should continue with dedicated tasks or grid-style running, each comes with their own benefits/drawbacks.
  
  NERSC: Harvester up and running on Cori-P1/P2, processed 50M+ events over the past 7 days.
  
  OLCF: Harvester now running for the Allocation jobs queue. Running 3 batch jobs at a time with 800 nodes each.
  
  Container deployment:
  
  NERSC: done
  
  OLCF/ALCF: still in development.
- 13:55 → 14:30
  Site Reports
  - 13:55
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    running fine in general
    
    new xrootd version (release candidate) was put in place last week, which fixed an issue that caused xrootd crash with core dumps. The official release will come later.
    
    new WNs are in production since several weeks ago. The migration to SL7 for the rest of the farm will be combined with upgrade of new top-of-rack-switches, to minimize the downtime.
  - 14:00
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    Four C6420 chassis and sleds are being racked today. We will configure them as SL7 WN as we get them up and ready to go.
    
    All WN at MSU are now running SL7 as are 1/3 of the WN at UM. We are developing a plan to move the balance of the UM WN to SL7 by the end of March.
    
    As of today, all of our dCache servers are dual IPv4/IPv6 stacked. We have not yet registered AAAA records though.
    
    We have a network interruption between UM and MSU on Thursday night that will adversely impact HTCondor communications between our sites for a period of up to 4 hours. Consequently we will be idling down all MSU WN starting later this afternoon so as to lose the minimal number of jobs during the outage.
    
    On Friday after the MSU WN set is back online, we will add SL7 Analysis and LMEM queues, rounding out the SL7 Panda Queue complement for AGLT2. When the complement of SL6 WN drops below some threshold, we will delete the SL6 Panda Queues and become SL7-only.
    
    We are coordinating with the OSG folks on moving our non-ATLAS gate-keeper to SL7. This will most likely happen some time next week.
    
    Singularity is installed on all WN as they are built, but no special configuration considerations have been implemented. Versions:
    
    singularity-2.4.2-1.osg34.el7.x86_64
    singularity-runtime-2.4.2-1.osg34.el7.x86_64
  - 14:05
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
    
    Overall, site is performing well and is full of jobs
    
    Singularity upgraded to 2.4.2 on all workers
    
    UC
    
    four of the twenty new C6420s online and running jobs
    
    remaining sixteen are built but still offline
    
    spec results are low (less than 50% of what they are expected to be)
    
    BIOS settings are consistent, appear to be correct
    
    has anyone else had this issue with this latest batch of workers?
    
    IU
    
    still waiting on power
    
    work order is in, but timeframe is unknown
    
    UIUC
    
    nothing new to report
  - 14:10
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Just about ready to start "NET3", a joint Tier 3 with BU, Harvard and UMASS/Amherst.
    
    Progress in HT-CONDOR migration with Brian Lin's help. Harvard has upgraded to OSG 3.4 with the new HTCONDOR. Problem has so far not reappeared. Setting up to do the same on the BU side.
    
    Working on LCMAPS and Bestman migration (we're not worried about usatlas1,2,3,4 since they are all in the same unix group and group permissions are enough to do everything). We're planning to use Wei's gridftp-posix with a callout for Adler32 checksum computing.
    
    Working on GPFS migration so that the system pool is on warrantied equipment.
    
    Preparing for starter NESE data lake deployment. ~ 12 PB raw, including substantial buy-in from Harvard.
    
    Reminder: We're planning to migrate the NET2 storage endpoint into NESE.
    
    Added Fermilab access for OSG jobs.
    
    Sites consistently full with smooth operations.
    
    Hoping for ESNet help to help restart our LHCONE peering.
    
    SL7 transition is on the agenda.
  - 14:15
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - all OU sites working well
    
    - still working on getting rucio to use READ_LAN and WRITE_LAN, in order to stage-in/out from internal xrootd directly. Working with Mario and Alexey on that
    
    - Lucille is ready to be migrated from Lucille_SE to OU_OSCER_ATLAS_SE
    
    - taking brief OSCER downtime this afternoon for RAM replacement and BIOS updates
  - 14:20
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    SWT2_CPB
    
    Seemingly solved an issue related to XRootD checksumming that was causing many problems.
    
    Major power outage when utility feed burned up and building generator failed. Both have been repaired
    
    Delayed working on HTCondor while dealing with the above
    
    UTA_SWT2
    
    Updated firmware in Dell 4032 stack to avoid issues with lockup
    
    Power outage at SWT2_CPB affected the network path for this cluster.
  - 14:25
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:30 → 14:35
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations