US ATLAS Computing Integration and Operations

Name: US ATLAS Computing Integration and Operations
Start: 2017-03-29T13:00:00-04:00
End: 2017-03-29T16:10:00-04:00
Location: No location set

Wednesday 29 Mar 2017, 13:00 → 16:10 US/Eastern

Description

Notes and other material available in the US ATLAS Integration Program Twiki

- 13:00 → 13:15
  
  Top of the Meeting 15m
  
  Speakers: Eric Christian Lancon (BNL), Robert William Gardner Jr (University of Chicago (US))
  
  Introduction
- 13:15 → 13:20
  
  ADC news and issues 5m
  
  Speakers: Robert Ball (University of Michigan (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  The OSG BDII is shutting down, and yesterday the ATLAS SAM tests will switch the way in which they test queues. From the announcement that was sent out by Ryu Sawada:
  
  "The ATLAS SAM tests are going to change the way they select the queues for the SAM tests. The selection so far was done using BDII information except for HTCONDOR-CEs. Soon it will be done selecting from the queues that are effectively used, i.e. the queues attached to the PandaQueues in AGIS and a new flag ETF_default=1. "
  
  "No negative impact is expected. But please watch SAM results of your site, and if you find any false results, please contact us for the correction by sending a ticket to GGUS."
  
  https://wlcg-mon.cern.ch/dashboard/request.py/siteviewhome
  
  JSON reporting of space usage is now active for all US sites.
- 13:20 → 13:30
  
  Production 10m
  
  Speaker: Mark Sosebee (University of Texas at Arlington (US))
  
  shift-summary-3_22_17.pdf
  
  shift-summary-3_29_17.pdf
- 13:30 → 13:35
  
  Data Management 5m
  
  Speaker: Armen Vartapetian (University of Texas at Arlington (US))
- 13:35 → 13:40
  
  Data transfers 5m
  
  Speaker: Hironori Ito (Brookhaven National Laboratory (US))
- 13:40 → 13:45
  
  Networks 5m
  
  Speaker: Dr Shawn McKee (University of Michigan ATLAS Group)
  
  perfSONAR v4.0 RC3 is out. Hoping this is basically the final version for v4.0 If no problems are found a release could happen in a couple weeks. Will need to get our sites update (auto-updates should run but needs checking)
  
  Working on new mesh-configuration. See https://meshconfig-itb.grid.iu.edu/ Will become the production version http://meshconfig.grid.iu.edu soon (next ~week). Everyone can get an account if interested. Need to request admin access for specific meshes if needed.
  
  Lots of reorganization of network service components planned in OSG. Will remove some ITB instances and rebalance resources (memory/CPU). New monitoring will be Docker based ETF running on CentOS7.3 VM. https://gitlab.cern.ch/etf/docker/blob/master/README.md Need updates for all services once perfSONAR v4.0 is released
  
  Next week is the LHCONE/LHCOPN meeting at BNL. Hope some of you will be attending. https://indico.cern.ch/event/581520/
  
  Analytics on network metrics showing occasional problems in packet loss at various locations. Need to start opening tickets (after perfSONAR v4).
  
  Analytics links:
  
  http://tiny.cc/PktLossNoUnknown (Shows 6 months of packet loss by src/dest)
  
  http://tiny.cc/pSLink (Shows network stats by specific site)
  
  Test emails by subscription are being issued, e.g.:
  
  Dear Shawn McKee,
  
  this mail is to let you that there was a significant change in packet loss detected by PerfSONAR.
  
  The site CA-SCINET-T2 (142.150.19.61)'s links got improved, total number from 5 to 0 links.
  These are all the bad links for the past hour:
  
  Best regards,
  ATLAS AAS
  
  Comments from Rob: Improve the email messages to make what is being communicated obvious.
- 13:45 → 13:50
  
  FAX and Xrootd Caching 5m
  
  Speakers: Andrew Bohdan Hanushevsky (SLAC National Accelerator Laboratory (US)), Andrew Hanushevsky, Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
  
  0. Xrootd proxy cache server at AGLT2.
  
  1. Under heavy load, the xrootd proxy cache sometimes can't send data back to some clients (broken pipe/send failure). Currently focus on checking OS/networking setting. Increase "txqueuelen" in NIC (ens2) from 1000 to 20000 - doesn't help. Reviewing other parameters.
  
  2. Question about uncommitted data in memory when a client close connection. Prefer to commit the data to disk to increase proxy efficiency but it is not always possible under heavy load. Will discard those data.
  
  3. Occasional lose of file descriptors (including TCP). 22 files so far in the last two days of stress test (out of 224k). _May_ due to a linux kernel semaphore bug which is fixed in the latest kernel. Need to confirm.
  
  4. After 1. is understood, will enter long period of stress test to check stability, memory usage, file/TCP descriptors, and networking.
  
  5. Packaging as a product.
- 13:50 → 14:10
  
  Site movers 20m
  
  Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  
  Ilija will send instruction about how to setup internal Xrootd doors for jobs to access sites' local storage.
- 14:10 → 14:30
  
  OS performances testing 20m
  
  Speaker: Doug Benjamin (Duke University (US))
  
  Doug
  
  Charge usage - ALCC - 5,527,098 hours
  
  ERCAP - 2,197,078 hours
- 14:30 → 16:05
  Site Reports
  - 14:30
    BNL 5m
    
    Speaker: Xin Zhao (Brookhaven National Laboratory (US))
    
    Site is in downtime now.
    
    dCache upgrade ongoing
    
    software version : 2.10 to 3.0.11
    
    network : channel bonding on dCache control nodes, for fault tolerance and load balance; upgrade of switch software
    
    Issues with AGIS PanDA queue blacklisting system
    
    resulted in loss of CPU cycles
    
    bugs with regard to downtime cancellation and manual online operation, fixed
    
    policy of switcher: when to drain a site before a downtime. ADC will revisit current policy and present to sites.
  - 14:35
    
    AGLT2 5m
    
    Speakers: Robert Ball (University of Michigan (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
    
    We have updated to dCache 3.x series from 2.16. There is a DB schema change that took 5 hours to complete. Unfortunately, our monthly chimera dumps are now broken as the schema change broke chimera_find.sh. Hiro promises that he can fix this, and there is also a dCache ticket open for it.
    
    Our gatekeepers are updated to OSG 3.3.21 now, and the new [Resource Entry xxx] sections are in place in the 30-gip.ini file. Following directions posted by Wei and John, AGIS was also updated to connect the listed queues.
    
    We have been notified that there will be a complete power outage in the UM server room on Saturday, June 24. We will plan on shutting down all services on Friday afternoon, June 23, to prep for this. Hopefully we can get much back up on Saturday afternoon, but that is far from certain at this advanced time.
  - 14:40
    MWT2 5m
    
    Speakers: David Lesny (Univ. Illinois at Urbana-Champaign (US)), Lincoln Bryant (University of Chicago (US))
    
    Site is full of jobs - operating well
    
    OSG 3.3.22 to installed on all gatekeepers
    
    Ready for AGIS reporting and BDII retirement
    
    HTCondor 8.4.11 installed across the site
    
    CVMFS 2.3.3 fully deployed (2.3.5 released soon)
    
    USERDISK decommissioning
    
    SCRATCHDISK increased to 300TB
    
    Waiting on ADC to change Panda Q to use SCRATCHDISK for output
    
    New switches at UChicago are fully deployed
    
    Cisco 6509 has been retired - all nodes moved to new Junipers
    
    Future will replace 8x10Gb connection to SciDMZ with 2x40Gb
    
    Network monitoring and other issues
    
    Full access to all switch data with SNMP at UChicago and Illinois
    
    Working on the same at Indiana
    
    Monitoring all port connections
    
    A 2x10Gb uplink at Indiana was degraded to 1x10Gb (fixed)
  - 14:45
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
    
    Smooth operations with full sites with the exceptions of
    
    1) Checksum mismatch errors. This generated a ticket for us, but the problem was on the source end. Details can be found here http://egg.bu.edu/NET2%7binf:NET2%7d/gadget:Studies/section:report/2017-03/checksum_mismatch_exotics/
    
    2) ATLASSCRATCHDISK space is being used.
    
    3) Deletions are still happening via Bestman at our site.
    
    4) We still have a mystery problem with HTCONDOR-CE where the site drains for not understood reasons. We're still investigating and have been in contact with Brian.
    
    5) Working intensively on NESE, MGHPCC floor and WAN networking. Had a very useful meeting with Alastair Dewhurst re: CEPH/Gridftp and his "Echo" project.
  - 14:50
    
    SWT2-OU 5m
    
    Speaker: Dr Horst Severini (University of Oklahoma (US))
    
    - sites are mostly running well
    
    - occasionally held HTCondor-CE jobs on OU_OSCER_ATLAS; potentially related to internal OSCER authentication issues; following up
  - 14:55
    
    SWT2-UTA 5m
    
    Speaker: Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA_SWT2
    
    Planning on updating the hardware soon.
    
    No production issues
    
    SWT2_CPB
    
    Had an issue with transfers from two Canadian sites (McGill, UToronto) due to asymmetric routing. CANARIE discovered the misconfigured router and fixed it.
    
    An issue with space reporting exists. One data server had a configuration issue and was reporting more space being used than what was physically on disk. This has been resolved and will see how much the overall space reporting has been impacted.
  - 15:00
    
    WT2 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
- 16:05 → 16:10
  
  AOB 5m