US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2018-04-11T13:00:00-04:00
End: 2018-04-11T15:00:00-04:00
Location: No location set

Wednesday 11 Apr 2018, 13:00 → 15:00 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:05
  
  Top of the Meeting 5m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  slides
- 13:15 → 13:20
  ADC news and issues 5m
  
  Minutes
  
  Speakers: Robert Ball (University of Michigan (US)), Xin Zhao (Brookhaven National Laboratory (US))
  AGIS cleanup on "Associated CE queues" section of PQs. Requested by Tadashi for harvester, applicable to other pilot submission channels as well.
  
  only put production-functional queues there
- 13:20 → 13:25
  
  OSG software issues 5m
  
  Speaker: Brian Lin (University of Wisconsin)
- 13:25 → 13:30
  
  Production 5m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-4_11_18.pdf
  
  shift-summary-4_4_18.pdf
- 13:30 → 13:35
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
  
  180411_DataManagement_Armen.pdf
- 13:35 → 13:40
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:40 → 13:45
  Networks 5m
  
  Minutes
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
  Today is the OSG area coordinators meeting focusing on networking. All are welcome to join: https://opensciencegrid.github.io/management/area-coordinators/
  
  We are getting the word out on two items related to perfSONAR:
  
  Please update your instances to CentOS 7 ASAP
  
  OSG networking services will be migrating from grid.iu.edu over to opensciencegrid.org (and moving from GOC to AGLT2) by May 31, 2018
  
  The next HEPiX NFV working group meeting is coming up April 25: https://indico.cern.ch/event/715631/
- 13:45 → 13:50
  
  XCache, R&D for the Data Delivery Layer 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  The Plan
- 13:50 → 13:55
  
  HPCs integration 5m
  
  Speaker: Taylor Childers (Argonne National Laboratory (US))
  
  Today's meeting
- 13:55 → 14:30
  Site Reports
  - 13:55
    BNL 5m
    
    Minutes
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    production
    
    fluctuation in the number of running job slots at the beginning of the week, due to the central service issue (expired cert)
    
    overlay option disabled in singularity on all WNs, in light of the reported vulnerability
    
    migration to SL7 ongoing, new purchased WNs are on SL7 (with singularity SL6 image). The rest is waiting to go together with the TOR switch migration, for minimizing the downtime.
    
    dcache pool nodes upgraded from 3.0.11 to 3.0.43, to fix a login handshake problem with xrootd clients new version ( > 4.7.0)
    
    transition to LCMAPS VOMS plugin ongoing on the CE side (rolling fashion). SE (dCache) is almost done (only corner case, grid-proxy-init users, not covered).
    
    BNL is in the process of creating the dump of name space. The program is currently running to fill the additional database column with full path, which will be used to create the dump. BNL has currently about 130M files. The process of filling the column is going a bit slower than 100Hz. So, it will take 3 weeks or so. But, once the table is filled and being kept up-to-date, the creation of the dump will be much faster.
  - 14:00
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    C6420_current_R2_PL_3day.png
    
    C6420_current_R2_PL.png
    
    C6420_current_R2_PR_3day.png
    
    C6420_current_R2_PR.png
    
    C6420 chassis and R6420 sleds (56 HT Cores) have been delivered to UM and MSU. 7 sleds are currently on-line and running HTCondor jobs. 12 additional sleds should be configured by COB Friday. The balance await the installation of a 10Gb switch at MSU.
    
    These C6420 are current hogs. See the attached plots.
    
    The UM site has a coolant leak, that will be addressed tomorrow. Unfortunately all cooling must be shut down to find the leak, so today we began to idle down the WN to power them off, minimizing the room head load. Service VMs and machines, along with dCache storage will remain online, hopefully within the cooling available from a portable unit that will be put in place during the repairs. These repairs are expected to be completed by COB tomorrow.
    
    Approximately 2/3 of our WN total are now running SL7. The last 1/3 will transition more slowly as operation of the muon-calibration center needs to be carefully transitioned.
    
    Our space reporting via ruby script transitioned to SL7 on Friday; and broke as ruby activerecord version has jumped by 2 major versions, from 2.3.18 to 4.2.6 . This was fixed by Shawn over the weekend, and our space reporting is once again up to date. The Wiki page below has been updated with the changed code.
    https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DcacheSpaceReportingJsonViaRuby
    
    The xrootd problem reported a few days ago and summarized at this URL
    https://github.com/dCache/dcache/pull/3562
    has not been observed here to the best of my knowledge. We are running dCache 4.0.5 with a mix of xrootd rpms, depending on whether the WN is SL6(4.6.1) or SL7(4.8.1).
    
    AGLT2 now has a full suite of SL7 PandaQueues in operation. SL6 PQ are also still enabled, and will remain so until the last SL6 WN is retired.
    
    To quote Xin, " overlay option disabled in singularity on all WNs, in light of the reported vulnerability "
  - 14:05
    MWT2 5m
    
    Minutes
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US))
    
    MWT2 Update
    
    UC
    
    C6420 deployment: all 20 new workers are now online and in production
    
    Elasticsearch: received new hardware to expand ES cluster
    
    dCache: updated to 3.1.32, applied LCMAPS configuration change
    
    Gatekeepers: reconfigured for LCMAPS
    
    IU
    
    C6420 deployment: working on getting the first C6420 built
    
    UIUC
    
    GPFS stale file handle on mwt2-gk and workers
    
    Stampede2 Update
    
    On Saturday/Sunday (3/24-3/26) Stampede2 was nearly empty
    
    Stampede PanDA Qs had many activated jobs available
    
    CONNECT_STAMPEDE_MCORE
    
    CONNECT_ES_STAMPEDE_MCORE
    
    Started a large run on S2
    
    640 nodes (out of 1680 nodes)
    
    Peaked at 30720 cores (out of 80640 cores)
    
    3840 simultaneous jobs
    
    Standard and ES MCORE jobs (simulation)
    
    Processed 70M events with 85K jobs over about 48 hours
    
    90% efficiency
    
    Steady state operation will be much lower
    
    300 nodes with 14400 cores (1800 jobs)
    
    Run stated yesterday (4/10-4/11)
    
    9M events, 10.5K jobs
    
    90% efficiency
    
    Asked by TACC Admins to limit number of jobs due to potential I/O load issues
    
    All PandaQs are now tagged with resource type HPC
    
    Using Singularity Image based on the one in atlas repository
    
    LSM with pCache gets an 85% hit rate on stagein files
  - 14:10
    
    NET2 5m
    
    Minutes
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    NET3 "Northeast Tier 3" successfully launched. 38 users from BU, Harvard, and UMASS/Amherst.
    
    BU+MIT+ESnet working on re-establishing our LHCONE peering. The current theory is that a bad card at MANLAN is causing a problem. Have been jumping on and off LHCONE in the past week for testing.
    
    Working on finishing off LCMAPs migration (with Brian Lin helping), turning off GRAM at BU and migrating from Bestman to Wei's Gridftp with Adler32 callout.
    
    Lots of NESE activity... The NESE first deployment ordered. 10.8 PB. First major test will be ATLAS storage endpoint.
    
    Production smooth with a noticeable switch over from mcore domination to production domination.
    
    Start planning for RH7 migration.
    
    Annual MGHPCC maintenance down day is May 22.
  - 14:15
    
    SWT2-OU 5m
    
    Minutes
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - all sites running well
    
    - Lucille is having A/C issues, though, so running at reduced capacity
    
    - still waiting for RUCIO server update in order to take advantage of local xrootd redirector for jobs running on OSCER
    
    - singularity tests currently on hold because we can't run without overlay turned on
  - 14:20
    SWT2-UTA 5m
    
    Minutes
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA_SWT2:
    
    Finished updating storage at site; now at 425TB
    
    Walked a rebuild through cluster to fix a Rocks issue
    
    Fixing an issue with SAM.
    
    SWT2_CPB:
    
    HTCondor transition now at 3,000 cores, scaling issues arrived. Investigating
    
    Seeing an issue with potential XRootD problem when checksumming long filenames
    
    SWT2_CPB_GROUPDISK has been retired.
    
    General:
    
    CPU time of jobs in Torque seems to be wrong. Startup/shutdown seems to be OK, but main processing time is not seen.
  - 14:25
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:30 → 14:35
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations

Share this page

Direct link

Social networks

Calendaring