US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2016-03-02T13:00:00-05:00
End: 2016-03-02T15:20:00-05:00
Location: your office

Wednesday 2 Mar 2016, 13:00 → 15:20 US/Eastern

virtual room (your office)

virtual room

your office

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:15
  Top of the Meeting 15m
  
  Speaker: Robert William Gardner Jr (University of Chicago (US))
  Apologies from Horst, Saul
  
  Forthcoming facilities workshop in Clemson, https://indico.cern.ch/event/472826/
  
  Week following Clemson there is a workshop that might be of interest for campus research HPC best practices, relevant for campus clusters: http://www.ncsa.illinois.edu/Conferences/ARCC/agenda.html
- 13:15 → 13:25
  
  Capacity News: Procurements & Retirements 10m
- 13:25 → 13:35
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-2_24_16.pdf
  
  shift-summary-3_2_16.pdf
- 13:35 → 13:40
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
- 13:40 → 13:45
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:45 → 13:50
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
- 13:50 → 13:55
  
  FAX and Xrootd Caching 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky, Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  From Andy:
  
  What needs to change for async caching support:
  
  1)   XrdPosix package to add async POSIX style I/O
  
  2)   XrdPss package to use async POSIX style I/O
  
  3)   XrdOucCache package to provide async cache interface, this also impacts XrdPosix package because it is responsible for loading and using the caching interface.
  
  The issue here is that all of these interfaces are public which means we need to implement this without breaking ABI compatibility (i.e. is must be backward compatible).
  
  Time estimates:
  
  a)   1 week to design and code up the new caching interface (4/5/16).
  
  b)   2 weeks to retrofit XrdPosix package to use (a) (3/21/16).
  
  c)   1 week to retrofit XrdPss package to use (b) (3/25/16).
  
  The above will always be available as work proceeds in the pssasync branch in the xroot github repo so other parallel work can proceed. Please be aware I go on vacation 3/28/16 for 12 days with limited if any internet connectivity so it is likely that we will not have a production quality version until 4/15/16 to 4/20/16, depending on how it goes.
- 14:15 → 15:15
  Site Reports
  - 14:15
    BNL 5m
    
    Speaker: Michael Ernst
    
    Smooth operations at full utilization of the compute farm (mostly MCORE)
    
    AWS 100k core test still in preparation
    
    Issues found with provisioning system based on APF
    
    Now understood and fixed
    
    Issues with S3 keys when running in 3 US regions
    
    Understood and fixed by pilot developers
    
    Scale test not to start before next week
    
    Hiro has developed and deployed data management services for end users working on the shared T3 at BNL
    
    Much improved bandwidth (over dq2-get) for data replication to T3 storage
    
    Deployment of FY16 disk storage in progress
    
    Hardware will be handed over to storage management group on or before March 15.
  - 14:20
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    pgsql was updated from 9.3.11 to 9.5.1 in advance of doing a dCache upgrade from the 2.10 to the 2.13 series. This occurred on Tuesday last week during a full downtime that day. At the same time our WNs were rebuilt completely, updating Condor to 8.4.4, cvmfs to 2.1.20, OSG-WN client to 3.3.8, glibc to 2.12-1.166.el6_7.7, and various other sl and sl-security updates. Gatekeepers were updated to OSG 3.3.9, utilizing the OSG installation of Condor 8.4.3. The master Condor machine is also on Condor 8.4.4, which works around a possible issue with the collector process in 8.4.3.
    
    Generally all upgrades went smoothly, modulo interactions between the various components. The dCache update in particular surprised us with how quickly it went. Several items were not immediately obvious, but a dCache documentation search showed the way. The xrootd plugins required a bit more work, and consultations between Gerd, Ilija and Shawn will likely result in new plugin rpms in the near future.
    
    There are no outstanding issues with our site at this time. However, we have noticed some recent jobs that are crashing WN. These jobs run a process called "JSAPrun.exe". Condor will suddenly report jobs running this process that have a (condor_status) LoadAv of many tens, even many hundreds, that results in the WN either crashing or becoming unresponsive. We then get hung_task_timeout dumps in /var/log/messages indicating processes that have been blocked for more than 120 seconds. We have only just discovered this, and so have not had a chance to do any further digging, but I mention it here because other sites may also be seeing this?
  - 14:25
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
    
    Site has been running well except for IU
    
    Networking problems at IU
    
    Last week it was issues in IndianaGigiPOP for ESnet conversion
    
    Down again as of last night, tickets pending
    
    Condor pool offline
    
    Scan for latest OpenSSL bug (Drownattack) shows MWT2 clean
    
    Minor update of dCache to 2.10.56-1
    
    Helped with some XrootD door issues
    
    Removed an old monitoring plugin that was causing java null pointer exceptions
    
    Still some issues Lincoln is following up with Gerd
    
    New Disk at UChicago
    
    Still in process of migrating LOCALGROUPDISK to Ceph
    
    Migrating user data from older Ceph system to new Ceph (many tiny files).
    
    Servers will be converted to dCache (~350TB)
    
    OSG 3.3.9
    
    New lsm-get in use removing DCAP needs for MWT2
    
    Reports to Elastic Search
    
    Will be switching compute nodes to OSG 3.3.9 wn client
    
    minRSS and maxRSS now set
    
    New Panda Queues for HIMEM
    
    MWT2_HIMEM (2G-5G) - only nodes with >= 5GB/core
    
    MWT2_HIMEM_MCORE (2G-3G) - only nodes with >= 3GB/core
    
    ANALY_MWT2_MCORE (cpus=8, maxrss 16GB)
    
    Very busy with jobs
    
    But users do not use all cores
    
    ATLAS Analytics
    
    Now keeping 1 copy of the data at Clemson, 1 copy at UC with redundant head nodes.
    
    Currently riding out a scheduled downtime at Clemson. Kibana was up but now seems to be down.
    
    Users have been notified.
    
    misc
    
    Cleaning Nagios cruft and converting to Icinga.
    
    Building SL7 machines and puppet rules for non-critical services.
    
    No plans to run OSG software on SL7 for now.
  - 14:30
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
  - 14:35
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - All sites running fine, no problems
    
    - Work ongoing to get OSG installed on new OSCER cluster. Currently working through some SELinux issues.
  - 14:40
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA_SWT2
    
    Facilty electrical work forced a shutdown over the weekend, During shutdown we added memory to nodes with 24GB of memory
    
    Provides ~320 additional single job slots or ~80 additional multi-core slots
    
    Doubles multicore capacity
    
    SWT2_CPB
    
    Bringing 400TB of storage online.
    
    UTA - Expecting network interruption this weekend.
  - 14:45
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
    
    Putting SMR based storage (~1PB useable) in service - add as needed. Moving selected data to SMR storage:
    
    large files (1GB+), haven't been accessed for 2+ years.
    
    Putting batch nodes in service via OpenStack. Delayed due to SLAC computing center personnel change.
- 15:15 → 15:20
  
  AOB 5m

Choose timezone

US ATLAS Computing Integration and Operations

virtual room

your office